Norconex Importer

3.x Release Notes

Release History

Version Date Description
3.0.1 2023-07-09
3.0.0 2022-01-02 Major release. NOT a drop-in replacement for 2.x.

3.0.1 Release date 2023-07-09 Download

New New DOMPreserveTransformer. #76
Updated Maven dependency updates: norconex-commons-maven-parent 1.0.2-SNAPSHOT.
Fixed Fix RegexTagger not picking up XML-configured "fieldMatcher".

3.0.0 Major release. NOT a drop-in replacement for 2.x. Release date 2022-01-02 Download

Updated Updated transitive dependencies with known vulnerabilities.
Updated Updated dependencies to avoid logging library detection conflict.
Updated Maven dependency updates: Apache Tika 1.27 (and its many transitive dependencies), UCAR jj2000 5.4, Opencsv 5.5.2, JAI Image-IO jpeg2000 1.4.0, JBIG2 ImageIO 2.0.
Fixed Fixed invalid configuration in POM "maven-dependency-plugin".
New Handlers now support XML "flow", which adds supports for if/ifNot/condition/then/else tags in XML configuration.
New New "condition" classes for XML "flow" configuration: BlankCondition, DateCondition, DOMCondition, NumericCondition, ReferenceCondition, ScriptCondition, and TextCondition.
New New RejectFilter.
New New CharsetUtil#firstNonBlankOrUTF8(...) methods.
New When not already set, an attempt to detect document character encoding is now always made before invoking handlers.
New New CommonMatchers class.
New New ImageTransformer class.
New New NoContentTransformer class.
New New -f or "outputMetaFormat" command-line argument for saving exported metadata fields in alternate formats.
New New TextFilter class.
New New ReferenceFilter class.
New New ExternalHandler class.
New New DOMFilter class.
New New EmptyFilter class.
New New RegexTagger class.
New New URLExtractorTagger class.
New New DOMDeleteTransformer class.
New New XMLStreamSplitter class.
New New HandlerDoc to ease handler implementations.
New Importer now uses an EventManager and triggers several events: IMPORTER_HANDLER_BEGIN, IMPORTER_HANDLER_END, IMPORTER_HANDLER_ERROR, IMPORTER_PARSER_BEGIN, IMPORTER_PARSER_END, IMPORTER_PARSER_ERROR
New New ImporterDocument#getStreamFactory() method.
New ReplaceTagger now has the option to discard values that are unchanged after replacement.
New New options on CharacterCaseTagger: "wordsFully", "stringFully", "sentences", and "sentencesFully".
New Most configurable classes adding/setting metadata values now have an extra "onSet" option for dictating how values are set: append, prepend, replace, optional.
New New DocInfo class.
New New ImporterRequest class.
New New option in DOMTagger to delete elements matched by a selector.
New Added time zone support to DateMetadataFilter.
New Added support for Webp image format.
Updated Now requires Java 8 or higher.
Updated Importer#importDocument(...) now expects an ImporterRequest or a Doc.
Updated Default allocated memory for caching of document content was increased by a factor of 10 (100MB max per document, 1GB max total).
Updated XML configuration of handlers had their XML tag names changed from "filter", "tagger", "transformer, "splitter" to simply "handler".
Updated JBIG2 image support now included under apache license.
Updated Logging now using SLF4J.
Updated Maven dependency updates: Norconex Commons Lang 2.0.0, Apache Tika 1.22, Apache Commons CLI 1.4, Junit 5.
Updated RegexFieldExtractor and RegexUtil have been deprecated in favor of Norconex Commons Lang FieldValueExtractor and Regex.
Updated RegexContentFilter and RegexMetadataFilter have been deprecated in favor of TextFilter.
Updated RegexReferenceFilter has been deprecated in favor of ReferenceFilter.
Updated DOMContentFilter has been deprecated in favor of DOMFilter.
Updated EmptyMetadataFilter has been deprecated in favor of EmptyFilter.
Updated TextPatternTagger has been deprecated in favor of RegexTagger.
Updated TextBetweenTagger now has "inclusive" and "caseSensitive" options configurable for each "between" details.
Updated Now using Path instead of File in many cases.
Updated Parsing no longer attempted on zero-length content.
Updated List of PropertyMatcher replaced with PropertyMatchers.
Updated ContentTypeDetector methods are now static.
Updated Eliminated Apache Tika log warnings on startup when missing specific optional libraries not package due to licensing (e.g. JPEG 2000, jbig2).
Updated Occurrences of accessors for overwrite="[false|true]" and onConflict="..." have been deprecated in favor of new onSet="...".
Updated Most places where regular expressions could be used now also support "basic" matching and "wildcard" as well as being able to ignore diacritical marks (e.g., accents).
Updated Most occurrences of "caseSensitive" or "caseInsensitive" configuration options are now replaced with "ignoreCase".
Updated Filters implementing AbstractStringFilter will now have their isStringContentMatching(...) method invoked at least once, even if there is no document content.
Updated "parsed" boolean arguments were replaced by ParseState.PRE and ParseState.POST.
Updated Many methods with a combinations of reference, input stream, and metadata were updated to now accept a Doc instance instead.
Removed Removed some of the methods deprecated in previous releases.
Removed Removed SplittableDocument.