Norconex Importer

Open-Source document text extractor and transformer

Getting Started Download

Content importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a computer file as plain text, whatever its native format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before importing/using it in your own service or application.

A typical but not limited usage is to “import” crawled content for use by a search engine. We invite you to consider one of Norconex Crawlers for this purpose (which rely on Norconex Importer).

Have a look at the supported file formats.

Latest news

Norconex Web Crawler 3.0.0 Released!
2022-01-05
The new major release of Norconex HTTP Collector it is finally here. Check out what's new. More...

Norconex HTTP Collector 3.0.0 Release Candidate 1
2021-10-10
Is this the last pre-release? Put it to the test and let us know! Includes applicable release candidates of core dependencies as well. More...

Norconex HTTP Collector 3.0.0 Milestone 2
2021-07-28
3.0.0 second milestone release. Includes applicable milestone releases of core dependencies as well. More...

Norconex HTTP Collector 3.0.0 Milestone 1
2021-03-01
A step closer to final release. Available with milestone releases of core dependencies as well. More...

Norconex HTTP Collector 3.0.0 snapshots available
2020-09-07
Development builds of upcoming version 3 now available to experiment with. More...