Norconex Importer

Open-Source document text extractor and transformer

Getting Started Download 2.10.0

Content importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a computer file as plain text, whatever its native format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before importing/using it in your own service or application.

A typical but not limited usage is to “import” crawled content for use by a search engine. We invite you to consider one of Norconex Collectors for this purpose (which rely on Norconex Importer).

Have a look at the supported file formats.

Latest news

Norconex HTTP Collector 3.0.0 snapshots available
2020-09-07
Development builds of upcoming version 3 now available to experiment with. More...

opensource.norconex.com
2020-09-07
All Norconex open-source projects are now grouped under the same domain. More...

Norconex HTTP and FileSystem Collectors 2.9.0 released
2019-12-22
New URL normalization rules, support for CMIS protocol, ACL extraction from more sources, etc. More...

Norconex Collector Core 1.10.0 released
2019-12-22
Unmanaged logs, max parallel crawlers, etc. More...

Norconex Importer 2.10.0 released
2019-12-22
New FieldReportTagger, Tika upgrade, fixes, etc. More...