Norconex Importer

3.x Release Notes

Release History

Version Date Description
3.2.0-SNAPSHOT 2026-??-?? Minor release.
3.1.0 2025-05-24 Minor release.
3.0.1 2023-07-09
3.0.0 2022-01-02 Major release. NOT a drop-in replacement for 2.x.

3.2.0-SNAPSHOT Minor release. Release date 2026-??-?? Download

This release is currently in development and the following information may change.
New New GrobidConfig class for configuring optional Grobid REST service integration (disabled by default). When enabled, Tika's JournalParser and GrobidRESTParser are used to extract metadata from scientific documents (e.g., PDF journal articles). Configurable via GenericDocumentParserFactory XML with a new "grobid" element.
New New AbstractTikaParser.MetadataFieldPolicy enum (PRESERVE, LEGACY, BOTH) to control how Dublin Core "dc:" prefixed metadata field names are mapped to Importer metadata. Default is BOTH, which stores values under both the original "dc:" name and the legacy unprefixed name.
New Added local SLF4J Logger/LoggerBridge shims for PDFBox JBIG2 to silence runtime warnings caused by missing logging classes in the updated PDFBox/JBIG2 dependency chain.
Updated Minimum Java Version is now 17.
Updated Apache Tika upgraded from 1.27 to 3.2.3. The monolithic "tika-parsers" artifact is replaced by the modular Tika 3.x packages: tika-parsers-standard-package, tika-parser-advancedmedia-package, tika-parser-nlp-package, tika-parser-scientific-package, and tika-parser-sqlite3-package.
Updated norconex-commons-maven-parent upgraded from 1.1.0 to 1.2.0.
Updated Dependency updates: commons-compress 1.21 -> 1.28.0, commons-io 2.21.0 (new explicit dependency), jsoup 1.15.3 -> 1.21.1, opencsv 5.5.2 -> 5.12.0, Apache POI 5.5.1 (poi, poi-ooxml-full, xmlbeans 5.3.0) now explicit dependencies, jbig2-imageio 3.0.4 added.
Updated Added nashorn-core 15.4 as an explicit dependency since the Nashorn JavaScript engine is no longer bundled with the JDK in Java 17+.
Updated ContentTypeDetector overhauled to support Tika 3.x: now uses a DefaultDetector with custom MIME types loaded explicitly. Added deep MS Office content-type disambiguation via OLE header and OOXML ZIP entry inspection, correctly resolving generic "application/x-tika-msoffice" and "application/x-tika-ooxml" to their specific types (Word, Excel, PowerPoint, Visio, etc.) using both file extension hints and binary content sniffing.
Updated AbstractTikaParser updated for Tika 3.x: TikaMetadataKeys replaced by TikaCoreProperties; OCR configuration now explicitly sets TesseractOCRParser in the ParseContext; improved DocumentParserException message to include the document reference; Grobid parser chain is disabled by default by removing JournalParser/GrobidRESTParser from the AutoDetectParser's internal list via reflection.
Updated GenericDocumentParserFactory: Tika startup warning-suppression (for TesseractOCRParser and SQLite3Parser) now uses Class.forName() and reflection to avoid hard compile-time dependency on classes that may not be present, preventing ClassNotFoundException at startup.
Updated LanguageTagger: updated Tika 3.x import for OptimaizeLangDetector (now under org.apache.tika.langdetect.optimaize package).
Updated PDFPageSplitter: updated to use PDFBox 3.x API (Loader.loadPDF(byte[]) instead of the removed PDDocument.load(InputStream)).
Updated FallbackParser: improved tika-config.xml loading with fallback to classloader resource lookup and ultimately TikaConfig.getDefaultConfig() when no custom config file is found on the classpath.
Updated custom-mimetypes.xml: added Tika XML namespace declaration, fixed XFDL root-XML localName case (XFDL), and removed invalid DOCTYPE.
Updated Maven Surefire plugin configured with --add-opens for Java module system compatibility.
Updated Maven central publishing switched from nexus-staging-maven-plugin to central-publishing-maven-plugin (Sonatype Central Portal).

3.1.0 Minor release. Release date 2025-05-24 Download

Updated Minimum Java Version is now 11.
Updated Dependency updates.

3.0.1 Release date 2023-07-09 Download

New New DOMPreserveTransformer. #76
Updated Maven dependency updates: norconex-commons-maven-parent 1.0.2.
Fixed Fix RegexTagger not picking up XML-configured "fieldMatcher".

3.0.0 Major release. NOT a drop-in replacement for 2.x. Release date 2022-01-02 Download

Updated Updated transitive dependencies with known vulnerabilities.
Updated Updated dependencies to avoid logging library detection conflict.
Updated Maven dependency updates: Apache Tika 1.27 (and its many transitive dependencies), UCAR jj2000 5.4, Opencsv 5.5.2, JAI Image-IO jpeg2000 1.4.0, JBIG2 ImageIO 2.0.
Fixed Fixed invalid configuration in POM "maven-dependency-plugin".
New Handlers now support XML "flow", which adds supports for if/ifNot/condition/then/else tags in XML configuration.
New New "condition" classes for XML "flow" configuration: BlankCondition, DateCondition, DOMCondition, NumericCondition, ReferenceCondition, ScriptCondition, and TextCondition.
New New RejectFilter.
New New CharsetUtil#firstNonBlankOrUTF8(...) methods.
New When not already set, an attempt to detect document character encoding is now always made before invoking handlers.
New New CommonMatchers class.
New New ImageTransformer class.
New New NoContentTransformer class.
New New -f or "outputMetaFormat" command-line argument for saving exported metadata fields in alternate formats.
New New TextFilter class.
New New ReferenceFilter class.
New New ExternalHandler class.
New New DOMFilter class.
New New EmptyFilter class.
New New RegexTagger class.
New New URLExtractorTagger class.
New New DOMDeleteTransformer class.
New New XMLStreamSplitter class.
New New HandlerDoc to ease handler implementations.
New Importer now uses an EventManager and triggers several events: IMPORTER_HANDLER_BEGIN, IMPORTER_HANDLER_END, IMPORTER_HANDLER_ERROR, IMPORTER_PARSER_BEGIN, IMPORTER_PARSER_END, IMPORTER_PARSER_ERROR
New New ImporterDocument#getStreamFactory() method.
New ReplaceTagger now has the option to discard values that are unchanged after replacement.
New New options on CharacterCaseTagger: "wordsFully", "stringFully", "sentences", and "sentencesFully".
New Most configurable classes adding/setting metadata values now have an extra "onSet" option for dictating how values are set: append, prepend, replace, optional.
New New DocInfo class.
New New ImporterRequest class.
New New option in DOMTagger to delete elements matched by a selector.
New Added time zone support to DateMetadataFilter.
New Added support for Webp image format.
Updated Now requires Java 8 or higher.
Updated Importer#importDocument(...) now expects an ImporterRequest or a Doc.
Updated Default allocated memory for caching of document content was increased by a factor of 10 (100MB max per document, 1GB max total).
Updated XML configuration of handlers had their XML tag names changed from "filter", "tagger", "transformer, "splitter" to simply "handler".
Updated JBIG2 image support now included under apache license.
Updated Logging now using SLF4J.
Updated Maven dependency updates: Norconex Commons Lang 2.0.0, Apache Tika 1.22, Apache Commons CLI 1.4, Junit 5.
Updated RegexFieldExtractor and RegexUtil have been deprecated in favor of Norconex Commons Lang FieldValueExtractor and Regex.
Updated RegexContentFilter and RegexMetadataFilter have been deprecated in favor of TextFilter.
Updated RegexReferenceFilter has been deprecated in favor of ReferenceFilter.
Updated DOMContentFilter has been deprecated in favor of DOMFilter.
Updated EmptyMetadataFilter has been deprecated in favor of EmptyFilter.
Updated TextPatternTagger has been deprecated in favor of RegexTagger.
Updated TextBetweenTagger now has "inclusive" and "caseSensitive" options configurable for each "between" details.
Updated Now using Path instead of File in many cases.
Updated Parsing no longer attempted on zero-length content.
Updated List of PropertyMatcher replaced with PropertyMatchers.
Updated ContentTypeDetector methods are now static.
Updated Eliminated Apache Tika log warnings on startup when missing specific optional libraries not package due to licensing (e.g. JPEG 2000, jbig2).
Updated Occurrences of accessors for overwrite="[false|true]" and onConflict="..." have been deprecated in favor of new onSet="...".
Updated Most places where regular expressions could be used now also support "basic" matching and "wildcard" as well as being able to ignore diacritical marks (e.g., accents).
Updated Most occurrences of "caseSensitive" or "caseInsensitive" configuration options are now replaced with "ignoreCase".
Updated Filters implementing AbstractStringFilter will now have their isStringContentMatching(...) method invoked at least once, even if there is no document content.
Updated "parsed" boolean arguments were replaced by ParseState.PRE and ParseState.POST.
Updated Many methods with a combinations of reference, input stream, and metadata were updated to now accept a Doc instance instead.
Removed Removed some of the methods deprecated in previous releases.
Removed Removed SplittableDocument.