Norconex Importer

2.x Release Notes

Release History

Version Date Description
2.11.0 2021-10-18 Feature release
2.10.0 2019-12-22 Feature release
2.9.0 2018-06-17 Feature release
2.8.0 2017-11-26 Feature release
2.7.2 2017-05-26 Bugfix release
2.7.1 2017-05-25 Maintenance release
2.7.0 2017-04-26 Feature release
2.6.1 2016-12-14 Minor release
2.6.0 2016-08-25 Feature release
2.5.2 2016-05-31 Maintenance release
2.5.1 2016-03-22 Bug fix release
2.5.0 2016-02-28 Feature release
2.4.0 2015-11-02 Feature release
2.3.1 2015-08-07 Maintenance release
2.3.0 2015-07-21 Feature release
2.2.0 2015-06-15 Feature release
2.1.1 2015-04-08 Maintenance release
2.1.0 2015-03-31 Feature release
2.0.0 2014-11-25 Major release
1.3.0 2014-08-18 Feature release
1.2.0 2014-03-09 Feature release
1.1.0 2013-08-20 Minor release
1.0.1 2013-08-02 Maintenance release
1.0.0 2013-06-04 Open Source release

2.11.0 Feature release Release date 2021-10-18 Download

New New NoContentTransformer ported from version 3.0.0.
Updated TitleGeneratorTagger now defaults to the first 10,000 characters for analysis to improve performance on large files.
Updated Transformers are now invoked at least once, even when a document has no content.
Fixed ExternalTransformer was improved to avoid concurrency exceptions.

2.10.0 Feature release Release date 2019-12-22 Download

New New FieldReportTagger for discovering fields being crawled to file (with sample values).
New HierarchyTagger now has a boolean "regex" attribute to specify whether the separator should match a regular expression.
New RenameTagger now as a boolean "regex" attribute to specify whether the fromField and toField are regular expression pattern and replacement.
Updated Maven dependency updates: Apache Tika 1.18, Norconex Commons Lang 1.15.1.
Updated HierarchyTagger no longer keep empty segments by default. A new "keepEmptySegments" attribute has been added for this.
Updated OCR configuration now expects full path of Tesseract executable (as opposed to installation folder).
Fixed Fixed HierarchyTagger not constructing paths properly. #91
Fixed Fixed ClassCastException when a IDocumentFilter does not implement IOnMatchFilter.
Fixed Fixed LanguageTagger choosing main language as the one with lowest probability. #82
Fixed Upgraded pdfbox to 2.0.11 due to potential security issue.

2.9.0 Feature release Release date 2018-06-17 Download

New New PDFPageSplitter to split PDF pages, treating them as individual documents.
Updated ImporterResponse and ImporterStatus now display nicely in the logs (toString implemented).
Updated Maven dependency updates: Norconex Commons Lang 1.15.0.
Fixed Fixed TitleGeneratorTagger throwing NullPointerException when "fromField" is specified but does not exists (is null). #74
Fixed Fixed "buffer underrun" exception sometimes appearing when parsing some .msg files with embedded files. #72

2.8.0 Feature release Release date 2017-11-26 Download

New New TruncateTagger class.
New New ExternalTagger class. #64
New ExternalTransformer and ExternalParser can now supply/retrieve metadata as files to external applications and can also pass the document reference as argument. New command line tokens: ${INPUT_META} ${OUTPUT_META} ${REFERENCE}. #63
New New configuration option for DOMTagger, DOMSplitter and DOMContentFilter for specifying which parser to use ("html" or "xml").
New TextPatternTagger can now extract field names in addition to field values. #52
New New RegexUtil and RegexFieldExtractor classes.
New TextPatternTagger case sensitivity is now applied to individual patterns.
Updated ReplaceTagger and ReplaceTransformer now support empty/null replacement values, resulting in replacing matches with nothing.
Updated ExternalTransformer and ExternalParser can now specify regex match groups for field names and field values.
Updated Now uses WordPerfect and Quattro Pro parsers contributed to Apache Tika.
Updated Maven dependency updates: Apache Tika 1.16, Norconex Commons Lang 1.14.0.
Fixed Fixed ExternalTransformer and ExternalParser having issues with arguments with spaces in them. #64
Removed Removed copies of Apache Tika classes that are now fixed in Apache Tika: ListTables, ImageParser, ListManager, PDF2XHTML, CharsetDetector.

2.7.2 Bugfix release Release date 2017-05-26 Download

Fixed Fixed "caseSensitive" flag sometimes having no effect in DOMContentFilter, RegexContentFilter, RegexMetadataFilter, and RegexReferenceFilter.

2.7.1 Maintenance release Release date 2017-05-25 Download

Updated ImporterConfig#saveToXML(...) now written with xml:space="preserve".
Updated Maven dependency updates: Norconex Commons Lang 1.13.1.

2.7.0 Feature release Release date 2017-04-26 Download

New Added Lua scripting support to ScriptFilter, ScriptTagger, and ScriptTransformer.
New New ExternalTransformer for transforming documents and extracting metadata using an external application.
New Added schema-based XML configuration validation which can be trigged on command prompt with this new flag: -k or --checkcfg
New New RegexReferenceFilter for filtering documents based on matching references (e.g. URL).
New New MergeTagger for combining multiple fields into one.
New New SubstringTransformer for reducing content (e.g., truncate) to a substring matching a begin and end indexes.
New New UUIDTagger for adding random Universally unique identifier (UUID) to documents.
New CharacterCaseTagger now supports "swap" and "string" to swap character case and capitalize beginning of a string, respectively.
New New ConstantTagger#setOnConflict(...) method to specify if the constant should be added to existing values, replace them, or do nothing.
New Now distributed with utility scripts.
Updated Dependency updates: Apache Tika 1.14, Norconex Commons Lang 1.13.0, JSoup 1.10.2, OOXML-Schemas 1.3 (fixes some bad Visio parsing), Apache Commons Collections 3.2.2.
Updated ExternalParser was rewritten. Now offers more metadata extraction options and environment variable support.
Updated Modified Javadoc to include an XML usage example for all XML-configurable classes.
Updated Dependent libraries for JPEG200 and JBIG2 image formats are no longer distributed with this product for licensing incompatibilities. To enable them, you will need JAR files found at these locations: http://central.maven.org/maven2/com/github/jai-imageio/jai-imageio-jpeg2000/ http://central.maven.org/maven2/com/levigo/jbig2/levigo-jbig2-imageio/
Fixed Fixed NoClassDefFoundError on some MS Visio files: com/microsoft/schemas/office/visio/x2012/main/ConnectsType
Fixed Fixed NullPointerException from parsing some Word documents. #41
Removed Removed FixedHtmlEncodingDetector class in favor of the fixed version of HtmlEncodingDetector. https://issues.apache.org/jira/browse/TIKA-1837
Removed Removed deprecated Importer HTMLParser and PDFParser classes.
Removed Removed deprecated IDocumentSplittableEmbeddedParser interface.
Removed Removed Importer EnhancedPDFParser and EnhancedPDF2XHTML in favor of upgraded TIKA PDFParser and PDF2XHTML versions.

2.6.1 Minor release Release date 2016-12-14 Download

New DOMTagger now supports a new flags called "matchBlanks" to extract elements that contain empty values or values made of white spaces only. #39
New ReplaceTagger now supports new flags: "wholeMatch" and "replaceAll".
Updated The default value in DOMTagger can how be an empty string or a string made of white spaces. #39
Updated Dependency updates: Norconex Commons Lang 1.12.3, Joda Time 2.9.4, Apache HTTP Client 4.5.2, Apache HTTP Core 4.4.5, JJ2000 5.3, JAI ImageIO jpeg2000 1.3.1
Fixed Fixed ReplaceTagger not adding replaced value to "toField" when it is the same as original value. #29
Fixed Fixed NoSuchMethodError when performing OCR on some PDFs with JPEG 2000 images in them.
Fixed Fixed "No ImageWriter found for 'jpx' format" when performing OCR on some PDFs with JPX images in them.

2.6.0 Feature release Release date 2016-08-25 Download

New New CountMatchesTagger that will count occurrences of matching substring or regular expression in a field value or document content and store the count in a target field.
New DateFormatTagger now accepts multiple source formats when attempting to convert dates, trying them in order provided.
New DOMTagger can now apply DOM selection on an optional "fromField" and can also use a "defaultValue" when there is no match. #28
New New DOM selector possibility for DOMContentFilter and DOMTagger: ownText, data, id, tagName, val, className, cssSelector, and attr(attributeKey).
New TranslatorSplitter now supports Yandex translation service.
New GenericDocumentParserFactory/AbstractTikaParser now allows you to control which embedded documents you do not want extracted from their containers.
New GenericDocumentParserFactory/AbstractTikaParser now allows you to control which documents containers you do not want to extract their embedded documents.
New GenericDocumentParserFactory/AbstractTikaParser now allows you to specify which content types to "split" their embedded documents via regular expression.
New GenericDocumentParserFactory now allows you to define and configure parsers via XML.
New New IHintsAwareParser interface for parsers that can benefit from global configuration settings.
New New ParseHints class holding generic configuration settings to be set on parsers implementing the new IHintsAwareParser.
New New EmbeddedConfig class holding configuration settings related to embedded documents. Used by ParseHints on GenericDocumentParserFactory.
New Can now pass optional -e or --contentEncoding to command line to explicitly set the character encoding (charset).
Updated LanguageTagger now uses Tika language detection (supports at least 70 languages).
Updated GenericDocumentParserFactory has been modified to introduce the concept of ParseHints which holds configuration settings every parsers have the option to support or not. Generic embedded and OCR configuration settings have been moved to the new ParseHints class.
Updated The following GenericDocumentParserFactory method are now deprecated: setSplitEmbedded(boolean), isSplitEmbedded(), setOCRConfig(OCRConfig), and getOCRConfig().
Updated It is now possible to configure ExternalParser via XML.
Updated Now validates configuration and variable file paths when launched on the command line (throws errors on invalid paths).
Updated Dependency updates: Tika 1.13 (which now uses PDFBox 2.x), Norconex Commons Lang 1.9.1, JSoup 1.9.2.
Updated OCRConfig#setContentTypes(String) and equivalent configuration option in GenericDocumentParserFactory now expects a regular expression as opposed to a coma-separated list of content types.
Updated DebugTagger now assumes UTF-8 instead of OS default charset when printing content.
Updated Subclasses of AbstractStringTagger will now see tagTextDocument(...) method invoked at least once even if there is no content supplied.
Fixed Fixed DOMTagger ignoring subsequent selectors when one selector has no match. #21
Fixed Fixed ContentTypeDetector not closing TikaInputStream properly resulting in temporary "apache-tika-XXX.tmp" files not being deleted properly.
Fixed Fixed infinite loop with DOMSplitter when some selectors are too generic.
Fixed AbstractCharStreamTagger now tolerates null content stream.

2.5.2 Maintenance release Release date 2016-05-31 Download

Updated It is now possible to specify a locale when parsing/formatting dates with CurrentDateTagger and DateFormatTagger.
Updated Dependency updates: PDFBox 2.0.0 (final release).

2.5.1 Bug fix release Release date 2016-03-22 Download

Updated Text-based transformers extending AbstractCharStreamTransformer now logs a warning when character encoding could not be detected, suggesting to make sure the content being transformed is text.
Fixed StripBetweenTransformer now accepts multiple strip endpoints with the same "start" regex.

2.5.0 Feature release Release date 2016-02-28 Download

New DOMTagger and DOMFilter can now be told how to return matching elements values (i.e., text, html, or outerHtml).
New New CharsetTagger to convert the character encoding of specified document metadata field into the desired target character encoding.
New New CharsetTransformer to convert the character encoding of a document content into the desired target character encoding.
New New CharsetUtil class offering simplified charset detection and conversion methods.
Updated The "log4j.properties" file has been moved from classes to the installation root directory.
Updated DOMTagger now returns matching element text as opposed to HTML (can be configured back to HTML).
Updated When used as pre-parse handlers, most handlers dealing with text now accepts a charset to use for parsing content, or will detect encoding when no charset is specified. This eliminates many bad character issues.
Updated Metadata document.contentEncoding is now always set when passed to importDocument method.
Updated Dependency updates: Apache Tika 1.12, Norconex Commons Lang 1.9.0. Jempbox 1.8.11 (still required by Tika JPegParser), PDFBox 2.0.0-RC3, Apache Commons CLI 1.3.1.
Updated Importer-specific version of Tika PDFParser was updated to work around PDFBox 2.0 no longer depending on Jempbox.
Updated Importer now issues a WARN instead of DEBUG sometimes thrown when importing fails.
Fixed Fixed invalid zip bomb detection on PDF with elements nested more than 100 level deep.
Fixed Fixed charset in HTML comments being wrongfully considered when charset is being detected.
Fixed Fixed NullPointerException being thrown with some PDFs when extracting multilingual items.

2.4.0 Feature release Release date 2015-11-02 Download

New The following new handlers enable using scripting languages to define processing logic: ScriptFilter, ScriptTagger, and ScriptTransformer.
New New DOMContentFilter to filter out XML/HTML documents containing identified element or element value using a friendly syntax to navigate a DOM-tree structure. #48
New New DOMSplitter handler to split XML/HTML documents into multiple documents based on a specified element.
New New DOMTagger handler to extract text elements from XML/HTML documents using a friendly syntax to navigate a DOM-tree structure.
New CharacterCaseTagger can now be applied to field names (in addition to, or instead of, values).
New New CommonRestrictions class to obtain restrictions commonly associated with certain documents.
New New methods on AbstractImporterHandler to deal with restrictions: #addRestriction(PropertyMatcher...), #addRestrictions(List) #removeRestriction(String), #getRestrictions() #removeRestriction(PropertyMatcher), #clearRestrictions()
Updated New file formats supported (brought by Tika update): GCMD DIF, Geographic ISO 19139 files, CBOR.
Updated Dependency updates: Apache Tika 1.10, JSoup 1.8.3, Norconex Commons Lang 1.8.0.
Updated Importer ExternalParser now uses corrected ExternalParser from Tika.
Updated AbstractStringTransformer#transformStringContent(...) now throws an ImporterHandlerException.
Updated Saved and loaded configuration-related classes are now equal. Methods equals/hashCode/toString for those classes are now implemented uniformly and where added where missing.
Fixed Fixed some configuration classes not always being saved to XML properly or giving errors.

2.3.1 Maintenance release Release date 2015-08-07 Download

Updated Dependency updates: Norconex Commons Lang 1.7.0.

2.3.0 Feature release Release date 2015-07-21 Download

New New TextPatternTagger for extracting text matching regular expressions out of a document content and storing matches into a field. New unit tests created for it.
Updated Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml).
Updated Javadoc fixes and updates.
Updated Library updates: Norconex Commons Lang 1.6.2.
Fixed Fixed NullPointerException in DebugTagger when a field contains a null value.

2.2.0 Feature release Release date 2015-06-15 Download

New New DocumentLengthTagger for adding the document byte length as a field to imported documents.
New New CurrentDateTagger for adding the current date as a field to imported documents.
New New NumericMetadataFilter for filtering documents based on whether a numeric field value matches a given numeric range.
New New DateMetadataFilter for filtering documents based on whether a date field value matches a given date range.
New New ExternalParser class which is used to run an external process for parsing files (e.g. pdftotext) of the associated content type.
Updated By default PDF parsing is now done with this flag set to true: "suppressDuplicateOverlappingText". This should eliminate the extraction of duplicate text in PDF where bolding is done by having multiple instance of the same string on top of each other.
Updated Complete rewrite of AbstractStringFilter, AbstractStringTagger, and AbstractStringTransformer to limit the memory taken for loading the content. Now the memory is specified in absolute terms instead of dynamically allocating it based on free memory (an approach that could cause OutOfMemory errors). All subclasses now accept a "maxReadSize" configuration option to set the maximum number of characters to process at once. #9
Updated The abstract methods accepting a "partial" boolean argument on AbstractStringFilter, AbstractStringTagger, and AbstractStringTransformer have been changed to now accept a "sectionIndex" integer, representing the document content section being processed. Only larger documents will be processed one section of text at a time (to preserve memory).
Updated AbstractCharStreamTransformer#transformTextDocument(...) now throws an ImporterHandlerException instead of IOException to be consistent with other handlers.
Updated TitleGeneratorTagger was re-written no longer uses Carrot, to reduce library dependencies.
Updated Removed custom Tika mappings for Microsoft Visio now that they have been added to default Tika mappings in Tika 1.8. Reference: https://issues.apache.org/jira/browse/TIKA-1286
Updated ReplaceTagger: now case insensitive by default. Added a new flag to turn case-sensitivity on/off. #addReplacement(...) methods have been deprecated in favor of addReplacement(Replacement).
Updated Regular expressions in RegexContentFilter, RegexMetadataFilter, ReplaceTagger, TextBetweenTagger, ReplaceTransformer, StripAfterTransformer, StripBeforeTransformer, and StripBetweenTransformer now always have the Pattern.DOTALL flags enabled and when case sensitivity is enabled for regex, Pattern.UNICODE_CASE is now always used.
Updated Library updates: Apache Tika 1.8, Norconex Commons Lang 1.6.1, Apache Commons CLI 1.3, Apache Jempbox 1.8.9, Jempbox 2.0.0. Removed these library "direct" dependencies: Carrot2 (3.9.4), Lucene Analyzers (5.0.0), and Stax2 API (3.1.4).
Updated Javadoc fixes and updates.
Updated New unit tests to cover all filter onMatch use cases.
Fixed Fixed filters not working properly when using onMatch="include". Affects all subclasses of AbstractDocumentFilter, which now details the include/exclude logic in its Javadoc (github collector-http#108).
Fixed Fixed "Too many open files" exception.
Fixed Fixed the "restrictTo" feature not always working for AbstractImporterHandler subclasses. #7

2.1.1 Maintenance release Release date 2015-04-08 Download

Updated PDFBox now uses latest snapshot (as opposed to a frozen one).
Updated Javadoc fixes.
Updated Library updates: SLF4J 1.7.12.

2.1.0 Feature release Release date 2015-03-31 Download

New Added OCR support using Tesseract open-source product. Configured by setting an OCRConfig to GenericDocumentParserFactory.
New Added document translation support with the new TranslatorSplitter. Support these translation APIs: Microsoft, Google, Lingo24, and Moses. Both the document content and/or chosen fields can be translated.
New New TitleGeneratorTagger to dynamically generate titles out of documents, using Carrot2 to extract the best terms.
New New EnhancedPDFParser and EnhancedPDF2XHTML classes modifying original Tika PDFParser to add support for PDF XFA (dynamic forms) text extraction as well as adding support for PDFBox 2.0.0 (which fixes the striping of space characters between words in many PDFs).
New New XFDLParser for parsing PureEdge Extensible Forms Description Language files (XFDL). Supports both Gzipped+Base64 and plain text versions.
New New WordPerfectParser class for parsing WordPerfect documents according to WordPerfect file specifications.
New New QuattroProParser class for parsing QuattroPro documents according to QuattroPro file specifications.
New New configuration "parseErrorsSaveDir" on importer configuration for saving files that caused parsing errors along with their exception and metadata if any.
New KeepOnlyTagger and DeleteTagger now supports regular expression for identifying fields to keep/delete. The field="" attribute has been replaced by a element.
New Added support for JBIG2 and jpeg2000 image formats.
Updated Improved content detection of MS Office and Corel Office documents when importing an input stream with no specified extension.
Updated Improved overall content detection accuracy and performance.
Updated Default allocated memory for caching of document content was increased by a factor of 10 (10MB max per document, 100MB max total).
Updated AbstractTikaParser can now be extended to modify Tika ParseContext.
Updated importer.bat and importer.sh will now load the log4j.properties from the ./classes folder.
Updated Now always flush output stream from parsers so implementors do not have to be concerned with this.
Updated Easier to extend GenericDocumentParserFactory to provide custom parsers. Dropped "registerNamedParser", "registerFallbackParser", and "getFallbackParser" methods in favor of new "createFallbackParser" and "createNamedParsers" methods.
Updated HTMLParser and PDFParser are now deprecated. HTML and PDF are now handled by the fall-back parser (auto-detected).
Updated IDocumentSplittableEmbeddedParser is now deprecated and has no effect. Will be deleted in a future release.
Updated Minor javadoc improvements and fixes.
Updated No longer adds null handlers (possible when configuration loading failed for an handler).
Updated Improved exception handling for configuration loading.
Updated Library updates: Tika 1.7, Norconex Commons Lang 1.6.0, JUnit 4.12, PDFBox 2.0.0 (SNAPSHOT-2015-03-28), Apache Commons Codec 1.10, Lucene Analyzer Common 5.0.0.
Updated Updated several maven plugins and added SonarQube maven plugin.
Updated Added Sonatype repository to pom.xml for snapshot releases.
Updated Added more unit tests for various content type parsing.
Fixed Fixed embedded objects not always having the right content-type.
Fixed Fixed invalid mapping between "application/wordperfect" content type and WordPerfectParser.
Fixed Fixed AbstractCharStreamTagger subclasses badly detecting character encoding and failing documents as a consequence.

2.0.0 Major release Release date 2014-11-25 Download

New Importing now returns an ImporterResponse, which may hold the imported document, along with nested documents, and and ImporterStatus.
New New IDocumentSplitter handler and related classes, allowing implementations to split documents into more documents.
New DefaultDocumentParserFactory can now be configured to treat embedded documents as distinct documents (committed separately). Parsers can now implement IDocumentSplittableEmbeddedParser to indicate they are supporting document splitting.
New DefaultDocumentParserFactory can now ignore parsing specified content-types.
New New IImporterResponseProcessor to process the import response.
New Document encoding can now be explicitly specified when importing and the value get stored as a metadata field.
New New ContentTypeDetector for detecting the content-type from documents.
New New ImporterDocument, holding all objects related to a document being imported.
New New ImporterMetadata, extending Properties to provide additional import-related convenience methods and constants.
New New CsvSplitter class for splitting coma-separated value files into multiple records/documents to be indexed.
New New RegexContentFilter for accepting/rejecting documents based on a successful regular expression match on their content.
New New CharacterCaseTagger for modifying the character case of a metadata field value.
New New DateFormatTagger for parsing/formatting date from specified metadata fields.
New New DebugTagger for logging document content and/or metadata to help with implementation and troubleshooting.
New New LanguageTagger which analyzes a document content to automatically detect and store as metadata the document language.
New New TextStatisticsTagger that stores as metadata statistical information about a document content (word count, average words per sentences, etc.).
New New AbstractDocument* class for each types of handlers, facilitating handler implementation.
New Directory where temporary files are created is now configurable.
New Added support for parsing .iso files.
Updated Now licensed under The Apache License, Version 2.0.
Updated Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used. This reduces I/O and improves performance.
Updated Now every handlers except filters can be restricted to matching metadata values (configurable).
Updated *.tagger, *.filter, and *.transformer handlers were move to *.handler.tagger, *.handler.filter, and *.handler.transformer.
Updated com.norconex.importer.ContentType has been replaced with com.norconex.commons.lang.file.ContentType.
Updated For consistency, several references to metadata field names were renamed to use the term "field" (instead of property or else).
Updated DefaultDocumentParserFactory was renamed to GenericDocumentParserFactory.
Removed Handler "contentTypeRegex" tag was removed from handlers that supported it in favor of the more flexible "restrictTo" tag(s).

1.3.0 Feature release Release date 2014-08-18 Download

New Now stores the content "family" for each documents as "importer.contentFamily". This is a higher level representation of a file content types.
New New SplitTagger: Split values into multiple-values using a separator of choice.
New New CopyTagger: copies document metadata fields to other fields.
New New HierarchyTagger: splits a field string into multiple segments representing each node of a hierarchical branch.
Updated Improved detection of certain mime types, such as those previously appearing as application/x-tika-*.
Updated ReplaceTagger now supports regular expressions (via a new "regex" flag).
Updated Can now detect these MS Viso mime-types properly: vsdx, vstc, vssx, vsdm, vstm, vssm.
Updated AbstractCharStreamTransformer now enforces streaming as UTF8.
Updated Now requires Java 7 or higher.
Fixed RelpaceTagger regular matching now only replaces matching "fromValue".

1.2.0 Feature release Release date 2014-03-09 Download

New Now extracts text from WordPerfect documents (new WordPerfectParser class).
New New transformer "ReduceConsecutivesTransformer" to reduce consecutive instances of the same string to only one instance.
New New transformer "ReplaceTransformer" to perform search and replace on document content using regular expression.
New New filter "EmptyMetadataFilter" to exclude/include documents with no data for one or more specified metadata properties.
Updated Library updates: Tika 1.5, Norconex Commons Lang 1.3.0.
Updated Now attempts to detect the character encoding from a character stream by looking at a Content-Type metadata. If none is present, defaults to UTF-8.
Fixed Fixed NPE in AbstractTextRestrictiveHandler when no content-type is found when used before parsing.

1.1.0 Minor release Release date 2013-08-20 Download

New New tagger "TextBetweenTagger" to extract strings from a document and store them into document meta data fields.
New New AbstractRestrictiveHandler and AbstractTextRestrictiveHandler abstract classes to facilitate re-use of common capabilities in handlers.
New New BufferUtil and Memory Util classes.
Updated AbstractRestrictiveTransformer now deprecated.
Updated Upgraded norconex-commons-lang to 1.1.0.

1.0.1 Maintenance release Release date 2013-08-02 Download

Updated Upgraded Apache Tika from 1.3 to 1.4.
Updated Removed dependency on aspectjrt due to GPL licencing incompatibility. If you need .iso parsing, you can manually download and add to the classpath.

1.0.0 Open Source release Release date 2013-06-04 Download

New Starting with this release, Norconex Importer is open-source under GPL.