Norconex Importer

Open-Source document text extractor and transformer

Content importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a computer file as plain text, whatever its native format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before importing/using it in your own service or application.

A typical but not limited usage is to “import” crawled content for use by a search engine. We invite you to consider one of Norconex Crawlers for this purpose (which rely on Norconex Importer).

Have a look at the supported file formats.

Latest news

Norconex Google Cloud Search Committer in v3 Stack

2026-07-01

Google Cloud Search Committer now has a Norconex-managed 3.x release line in the v3 stack. The legacy 2.x line developed and hosted by Google remains available. More...

Norconex Filesystem Collector joins v3 Stack

2026-07-01

File System Crawler now has a Norconex-managed 3.x release line in the synchronized v3 stack, while the legacy 2.x line remains available. More...

Norconex v3 Stack Coordinates Update

2026-07-01

The v3 train now uses synchronized 3.2.0-SNAPSHOT versions with the new com.norconex.collectors.v3 groupId and Maven path strategy. More...

Norconex Web Crawler 3.1.0 Released

2025-05-24

Additional options for WebDriver fetcher, bug fixes, and others. More...