Norconex Importer

3.x Release Notes

Release History

Version	Date	Description
3.2.0-SNAPSHOT	2026-??-??	Minor release.
3.1.0	2025-05-24	Minor release.
3.0.1	2023-07-09
3.0.0	2022-01-02	Major release. NOT a drop-in replacement for 2.x.

3.2.0-SNAPSHOT Minor release. Release date 2026-??-?? Download

This release is currently in development and the following information may change.

New	New GrobidConfig class for configuring optional Grobid REST service integration (disabled by default). When enabled, Tika's JournalParser and GrobidRESTParser are used to extract metadata from scientific documents (e.g., PDF journal articles). Configurable via GenericDocumentParserFactory XML with a new "grobid" element.
New	New AbstractTikaParser.MetadataFieldPolicy enum (PRESERVE, LEGACY, BOTH) to control how Dublin Core "dc:" prefixed metadata field names are mapped to Importer metadata. Default is BOTH, which stores values under both the original "dc:" name and the legacy unprefixed name.
New	Added local SLF4J Logger/LoggerBridge shims for PDFBox JBIG2 to silence runtime warnings caused by missing logging classes in the updated PDFBox/JBIG2 dependency chain.
Updated	Minimum Java Version is now 17.
Updated	Apache Tika upgraded from 1.27 to 3.2.3. The monolithic "tika-parsers" artifact is replaced by the modular Tika 3.x packages: tika-parsers-standard-package, tika-parser-advancedmedia-package, tika-parser-nlp-package, tika-parser-scientific-package, and tika-parser-sqlite3-package.
Updated	norconex-commons-maven-parent upgraded from 1.1.0 to 1.2.0.
Updated	Dependency updates: commons-compress 1.21 -> 1.28.0, commons-io 2.21.0 (new explicit dependency), jsoup 1.15.3 -> 1.21.1, opencsv 5.5.2 -> 5.12.0, Apache POI 5.5.1 (poi, poi-ooxml-full, xmlbeans 5.3.0) now explicit dependencies, jbig2-imageio 3.0.4 added.
Updated	Added nashorn-core 15.4 as an explicit dependency since the Nashorn JavaScript engine is no longer bundled with the JDK in Java 17+.
Updated	ContentTypeDetector overhauled to support Tika 3.x: now uses a DefaultDetector with custom MIME types loaded explicitly. Added deep MS Office content-type disambiguation via OLE header and OOXML ZIP entry inspection, correctly resolving generic "application/x-tika-msoffice" and "application/x-tika-ooxml" to their specific types (Word, Excel, PowerPoint, Visio, etc.) using both file extension hints and binary content sniffing.
Updated	AbstractTikaParser updated for Tika 3.x: TikaMetadataKeys replaced by TikaCoreProperties; OCR configuration now explicitly sets TesseractOCRParser in the ParseContext; improved DocumentParserException message to include the document reference; Grobid parser chain is disabled by default by removing JournalParser/GrobidRESTParser from the AutoDetectParser's internal list via reflection.
Updated	GenericDocumentParserFactory: Tika startup warning-suppression (for TesseractOCRParser and SQLite3Parser) now uses Class.forName() and reflection to avoid hard compile-time dependency on classes that may not be present, preventing ClassNotFoundException at startup.
Updated	LanguageTagger: updated Tika 3.x import for OptimaizeLangDetector (now under org.apache.tika.langdetect.optimaize package).
Updated	PDFPageSplitter: updated to use PDFBox 3.x API (Loader.loadPDF(byte[]) instead of the removed PDDocument.load(InputStream)).
Updated	FallbackParser: improved tika-config.xml loading with fallback to classloader resource lookup and ultimately TikaConfig.getDefaultConfig() when no custom config file is found on the classpath.
Updated	custom-mimetypes.xml: added Tika XML namespace declaration, fixed XFDL root-XML localName case (XFDL), and removed invalid DOCTYPE.
Updated	Maven Surefire plugin configured with --add-opens for Java module system compatibility.
Updated	Maven central publishing switched from nexus-staging-maven-plugin to central-publishing-maven-plugin (Sonatype Central Portal).

3.1.0 Minor release. Release date 2025-05-24 Download

Updated	Minimum Java Version is now 11.
Updated	Dependency updates.

3.0.1 Release date 2023-07-09 Download

New	New DOMPreserveTransformer.	#76
Updated	Maven dependency updates: norconex-commons-maven-parent 1.0.2.
Fixed	Fix RegexTagger not picking up XML-configured "fieldMatcher".

3.0.0 Major release. NOT a drop-in replacement for 2.x. Release date 2022-01-02 Download

Updated	Updated transitive dependencies with known vulnerabilities.
Updated	Updated dependencies to avoid logging library detection conflict.
Updated	Maven dependency updates: Apache Tika 1.27 (and its many transitive dependencies), UCAR jj2000 5.4, Opencsv 5.5.2, JAI Image-IO jpeg2000 1.4.0, JBIG2 ImageIO 2.0.
Fixed	Fixed invalid configuration in POM "maven-dependency-plugin".
New	Handlers now support XML "flow", which adds supports for if/ifNot/condition/then/else tags in XML configuration.
New	New "condition" classes for XML "flow" configuration: BlankCondition, DateCondition, DOMCondition, NumericCondition, ReferenceCondition, ScriptCondition, and TextCondition.
New	New RejectFilter.
New	New CharsetUtil#firstNonBlankOrUTF8(...) methods.
New	When not already set, an attempt to detect document character encoding is now always made before invoking handlers.
New	New CommonMatchers class.
New	New ImageTransformer class.
New	New NoContentTransformer class.
New	New -f or "outputMetaFormat" command-line argument for saving exported metadata fields in alternate formats.
New	New TextFilter class.
New	New ReferenceFilter class.
New	New ExternalHandler class.
New	New DOMFilter class.
New	New EmptyFilter class.
New	New RegexTagger class.
New	New URLExtractorTagger class.
New	New DOMDeleteTransformer class.
New	New XMLStreamSplitter class.
New	New HandlerDoc to ease handler implementations.
New	Importer now uses an EventManager and triggers several events: IMPORTER_HANDLER_BEGIN, IMPORTER_HANDLER_END, IMPORTER_HANDLER_ERROR, IMPORTER_PARSER_BEGIN, IMPORTER_PARSER_END, IMPORTER_PARSER_ERROR
New	New ImporterDocument#getStreamFactory() method.
New	ReplaceTagger now has the option to discard values that are unchanged after replacement.
New	New options on CharacterCaseTagger: "wordsFully", "stringFully", "sentences", and "sentencesFully".
New	Most configurable classes adding/setting metadata values now have an extra "onSet" option for dictating how values are set: append, prepend, replace, optional.
New	New DocInfo class.
New	New ImporterRequest class.
New	New option in DOMTagger to delete elements matched by a selector.
New	Added time zone support to DateMetadataFilter.
New	Added support for Webp image format.
Updated	Now requires Java 8 or higher.
Updated	Importer#importDocument(...) now expects an ImporterRequest or a Doc.
Updated	Default allocated memory for caching of document content was increased by a factor of 10 (100MB max per document, 1GB max total).
Updated	XML configuration of handlers had their XML tag names changed from "filter", "tagger", "transformer, "splitter" to simply "handler".
Updated	JBIG2 image support now included under apache license.
Updated	Logging now using SLF4J.
Updated	Maven dependency updates: Norconex Commons Lang 2.0.0, Apache Tika 1.22, Apache Commons CLI 1.4, Junit 5.
Updated	RegexFieldExtractor and RegexUtil have been deprecated in favor of Norconex Commons Lang FieldValueExtractor and Regex.
Updated	RegexContentFilter and RegexMetadataFilter have been deprecated in favor of TextFilter.
Updated	RegexReferenceFilter has been deprecated in favor of ReferenceFilter.
Updated	DOMContentFilter has been deprecated in favor of DOMFilter.
Updated	EmptyMetadataFilter has been deprecated in favor of EmptyFilter.
Updated	TextPatternTagger has been deprecated in favor of RegexTagger.
Updated	TextBetweenTagger now has "inclusive" and "caseSensitive" options configurable for each "between" details.
Updated	Now using Path instead of File in many cases.
Updated	Parsing no longer attempted on zero-length content.
Updated	List of PropertyMatcher replaced with PropertyMatchers.
Updated	ContentTypeDetector methods are now static.
Updated	Eliminated Apache Tika log warnings on startup when missing specific optional libraries not package due to licensing (e.g. JPEG 2000, jbig2).
Updated	Occurrences of accessors for overwrite="[false\|true]" and onConflict="..." have been deprecated in favor of new onSet="...".
Updated	Most places where regular expressions could be used now also support "basic" matching and "wildcard" as well as being able to ignore diacritical marks (e.g., accents).
Updated	Most occurrences of "caseSensitive" or "caseInsensitive" configuration options are now replaced with "ignoreCase".
Updated	Filters implementing AbstractStringFilter will now have their isStringContentMatching(...) method invoked at least once, even if there is no document content.
Updated	"parsed" boolean arguments were replaced by ParseState.PRE and ParseState.POST.
Updated	Many methods with a combinations of reference, input stream, and metadata were updated to now accept a Doc instance instead.
Removed	Removed some of the methods deprecated in previous releases.
Removed	Removed SplittableDocument.