Norconex Importer

2.x Release Notes

Release History

Version	Date	Description
2.11.0	2021-10-18	Feature release
2.10.0	2019-12-22	Feature release
2.9.0	2018-06-17	Feature release
2.8.0	2017-11-26	Feature release
2.7.2	2017-05-26	Bugfix release
2.7.1	2017-05-25	Maintenance release
2.7.0	2017-04-26	Feature release
2.6.1	2016-12-14	Minor release
2.6.0	2016-08-25	Feature release
2.5.2	2016-05-31	Maintenance release
2.5.1	2016-03-22	Bug fix release
2.5.0	2016-02-28	Feature release
2.4.0	2015-11-02	Feature release
2.3.1	2015-08-07	Maintenance release
2.3.0	2015-07-21	Feature release
2.2.0	2015-06-15	Feature release
2.1.1	2015-04-08	Maintenance release
2.1.0	2015-03-31	Feature release
2.0.0	2014-11-25	Major release
1.3.0	2014-08-18	Feature release
1.2.0	2014-03-09	Feature release
1.1.0	2013-08-20	Minor release
1.0.1	2013-08-02	Maintenance release
1.0.0	2013-06-04	Open Source release

2.11.0 Feature release Release date 2021-10-18 Download

New	New NoContentTransformer ported from version 3.0.0.
Updated	TitleGeneratorTagger now defaults to the first 10,000 characters for analysis to improve performance on large files.
Updated	Transformers are now invoked at least once, even when a document has no content.
Fixed	ExternalTransformer was improved to avoid concurrency exceptions.

2.10.0 Feature release Release date 2019-12-22 Download

New	New FieldReportTagger for discovering fields being crawled to file (with sample values).
New	HierarchyTagger now has a boolean "regex" attribute to specify whether the separator should match a regular expression.
New	RenameTagger now as a boolean "regex" attribute to specify whether the fromField and toField are regular expression pattern and replacement.
Updated	Maven dependency updates: Apache Tika 1.18, Norconex Commons Lang 1.15.1.
Updated	HierarchyTagger no longer keep empty segments by default. A new "keepEmptySegments" attribute has been added for this.
Updated	OCR configuration now expects full path of Tesseract executable (as opposed to installation folder).
Fixed	Fixed HierarchyTagger not constructing paths properly.	#91
Fixed	Fixed ClassCastException when a IDocumentFilter does not implement IOnMatchFilter.
Fixed	Fixed LanguageTagger choosing main language as the one with lowest probability.	#82
Fixed	Upgraded pdfbox to 2.0.11 due to potential security issue.

2.9.0 Feature release Release date 2018-06-17 Download

New	New PDFPageSplitter to split PDF pages, treating them as individual documents.
Updated	ImporterResponse and ImporterStatus now display nicely in the logs (toString implemented).
Updated	Maven dependency updates: Norconex Commons Lang 1.15.0.
Fixed	Fixed TitleGeneratorTagger throwing NullPointerException when "fromField" is specified but does not exists (is null).	#74
Fixed	Fixed "buffer underrun" exception sometimes appearing when parsing some .msg files with embedded files.	#72

2.8.0 Feature release Release date 2017-11-26 Download

New	New TruncateTagger class.
New	New ExternalTagger class.	#64
New	ExternalTransformer and ExternalParser can now supply/retrieve metadata as files to external applications and can also pass the document reference as argument. New command line tokens: ${INPUT_META} ${OUTPUT_META} ${REFERENCE}.	#63
New	New configuration option for DOMTagger, DOMSplitter and DOMContentFilter for specifying which parser to use ("html" or "xml").
New	TextPatternTagger can now extract field names in addition to field values.	#52
New	New RegexUtil and RegexFieldExtractor classes.
New	TextPatternTagger case sensitivity is now applied to individual patterns.
Updated	ReplaceTagger and ReplaceTransformer now support empty/null replacement values, resulting in replacing matches with nothing.
Updated	ExternalTransformer and ExternalParser can now specify regex match groups for field names and field values.
Updated	Now uses WordPerfect and Quattro Pro parsers contributed to Apache Tika.
Updated	Maven dependency updates: Apache Tika 1.16, Norconex Commons Lang 1.14.0.
Fixed	Fixed ExternalTransformer and ExternalParser having issues with arguments with spaces in them.	#64
Removed	Removed copies of Apache Tika classes that are now fixed in Apache Tika: ListTables, ImageParser, ListManager, PDF2XHTML, CharsetDetector.

2.7.2 Bugfix release Release date 2017-05-26 Download

Fixed	Fixed "caseSensitive" flag sometimes having no effect in DOMContentFilter, RegexContentFilter, RegexMetadataFilter, and RegexReferenceFilter.

2.7.1 Maintenance release Release date 2017-05-25 Download

Updated	ImporterConfig#saveToXML(...) now written with xml:space="preserve".
Updated	Maven dependency updates: Norconex Commons Lang 1.13.1.

2.7.0 Feature release Release date 2017-04-26 Download

New	Added Lua scripting support to ScriptFilter, ScriptTagger, and ScriptTransformer.
New	New ExternalTransformer for transforming documents and extracting metadata using an external application.
New	Added schema-based XML configuration validation which can be trigged on command prompt with this new flag: -k or --checkcfg
New	New RegexReferenceFilter for filtering documents based on matching references (e.g. URL).
New	New MergeTagger for combining multiple fields into one.
New	New SubstringTransformer for reducing content (e.g., truncate) to a substring matching a begin and end indexes.
New	New UUIDTagger for adding random Universally unique identifier (UUID) to documents.
New	CharacterCaseTagger now supports "swap" and "string" to swap character case and capitalize beginning of a string, respectively.
New	New ConstantTagger#setOnConflict(...) method to specify if the constant should be added to existing values, replace them, or do nothing.
New	Now distributed with utility scripts.
Updated	Dependency updates: Apache Tika 1.14, Norconex Commons Lang 1.13.0, JSoup 1.10.2, OOXML-Schemas 1.3 (fixes some bad Visio parsing), Apache Commons Collections 3.2.2.
Updated	ExternalParser was rewritten. Now offers more metadata extraction options and environment variable support.
Updated	Modified Javadoc to include an XML usage example for all XML-configurable classes.
Updated	Dependent libraries for JPEG200 and JBIG2 image formats are no longer distributed with this product for licensing incompatibilities. To enable them, you will need JAR files found at these locations: http://central.maven.org/maven2/com/github/jai-imageio/jai-imageio-jpeg2000/ http://central.maven.org/maven2/com/levigo/jbig2/levigo-jbig2-imageio/
Fixed	Fixed NoClassDefFoundError on some MS Visio files: com/microsoft/schemas/office/visio/x2012/main/ConnectsType
Fixed	Fixed NullPointerException from parsing some Word documents.	#41
Removed	Removed FixedHtmlEncodingDetector class in favor of the fixed version of HtmlEncodingDetector. https://issues.apache.org/jira/browse/TIKA-1837
Removed	Removed deprecated Importer HTMLParser and PDFParser classes.
Removed	Removed deprecated IDocumentSplittableEmbeddedParser interface.
Removed	Removed Importer EnhancedPDFParser and EnhancedPDF2XHTML in favor of upgraded TIKA PDFParser and PDF2XHTML versions.

2.6.1 Minor release Release date 2016-12-14 Download

New	DOMTagger now supports a new flags called "matchBlanks" to extract elements that contain empty values or values made of white spaces only.	#39
New	ReplaceTagger now supports new flags: "wholeMatch" and "replaceAll".
Updated	The default value in DOMTagger can how be an empty string or a string made of white spaces.	#39
Updated	Dependency updates: Norconex Commons Lang 1.12.3, Joda Time 2.9.4, Apache HTTP Client 4.5.2, Apache HTTP Core 4.4.5, JJ2000 5.3, JAI ImageIO jpeg2000 1.3.1
Fixed	Fixed ReplaceTagger not adding replaced value to "toField" when it is the same as original value.	#29
Fixed	Fixed NoSuchMethodError when performing OCR on some PDFs with JPEG 2000 images in them.
Fixed	Fixed "No ImageWriter found for 'jpx' format" when performing OCR on some PDFs with JPX images in them.

2.6.0 Feature release Release date 2016-08-25 Download

New	New CountMatchesTagger that will count occurrences of matching substring or regular expression in a field value or document content and store the count in a target field.
New	DateFormatTagger now accepts multiple source formats when attempting to convert dates, trying them in order provided.
New	DOMTagger can now apply DOM selection on an optional "fromField" and can also use a "defaultValue" when there is no match.	#28
New	New DOM selector possibility for DOMContentFilter and DOMTagger: ownText, data, id, tagName, val, className, cssSelector, and attr(attributeKey).
New	TranslatorSplitter now supports Yandex translation service.
New	GenericDocumentParserFactory/AbstractTikaParser now allows you to control which embedded documents you do not want extracted from their containers.
New	GenericDocumentParserFactory/AbstractTikaParser now allows you to control which documents containers you do not want to extract their embedded documents.
New	GenericDocumentParserFactory/AbstractTikaParser now allows you to specify which content types to "split" their embedded documents via regular expression.
New	GenericDocumentParserFactory now allows you to define and configure parsers via XML.
New	New IHintsAwareParser interface for parsers that can benefit from global configuration settings.
New	New ParseHints class holding generic configuration settings to be set on parsers implementing the new IHintsAwareParser.
New	New EmbeddedConfig class holding configuration settings related to embedded documents. Used by ParseHints on GenericDocumentParserFactory.
New	Can now pass optional -e or --contentEncoding to command line to explicitly set the character encoding (charset).
Updated	LanguageTagger now uses Tika language detection (supports at least 70 languages).
Updated	GenericDocumentParserFactory has been modified to introduce the concept of ParseHints which holds configuration settings every parsers have the option to support or not. Generic embedded and OCR configuration settings have been moved to the new ParseHints class.
Updated	The following GenericDocumentParserFactory method are now deprecated: setSplitEmbedded(boolean), isSplitEmbedded(), setOCRConfig(OCRConfig), and getOCRConfig().
Updated	It is now possible to configure ExternalParser via XML.
Updated	Now validates configuration and variable file paths when launched on the command line (throws errors on invalid paths).
Updated	Dependency updates: Tika 1.13 (which now uses PDFBox 2.x), Norconex Commons Lang 1.9.1, JSoup 1.9.2.
Updated	OCRConfig#setContentTypes(String) and equivalent configuration option in GenericDocumentParserFactory now expects a regular expression as opposed to a coma-separated list of content types.
Updated	DebugTagger now assumes UTF-8 instead of OS default charset when printing content.
Updated	Subclasses of AbstractStringTagger will now see tagTextDocument(...) method invoked at least once even if there is no content supplied.
Fixed	Fixed DOMTagger ignoring subsequent selectors when one selector has no match.	#21
Fixed	Fixed ContentTypeDetector not closing TikaInputStream properly resulting in temporary "apache-tika-XXX.tmp" files not being deleted properly.
Fixed	Fixed infinite loop with DOMSplitter when some selectors are too generic.
Fixed	AbstractCharStreamTagger now tolerates null content stream.

2.5.2 Maintenance release Release date 2016-05-31 Download

Updated	It is now possible to specify a locale when parsing/formatting dates with CurrentDateTagger and DateFormatTagger.
Updated	Dependency updates: PDFBox 2.0.0 (final release).

2.5.1 Bug fix release Release date 2016-03-22 Download

Updated	Text-based transformers extending AbstractCharStreamTransformer now logs a warning when character encoding could not be detected, suggesting to make sure the content being transformed is text.
Fixed	StripBetweenTransformer now accepts multiple strip endpoints with the same "start" regex.

2.5.0 Feature release Release date 2016-02-28 Download

New	DOMTagger and DOMFilter can now be told how to return matching elements values (i.e., text, html, or outerHtml).
New	New CharsetTagger to convert the character encoding of specified document metadata field into the desired target character encoding.
New	New CharsetTransformer to convert the character encoding of a document content into the desired target character encoding.
New	New CharsetUtil class offering simplified charset detection and conversion methods.
Updated	The "log4j.properties" file has been moved from classes to the installation root directory.
Updated	DOMTagger now returns matching element text as opposed to HTML (can be configured back to HTML).
Updated	When used as pre-parse handlers, most handlers dealing with text now accepts a charset to use for parsing content, or will detect encoding when no charset is specified. This eliminates many bad character issues.
Updated	Metadata document.contentEncoding is now always set when passed to importDocument method.
Updated	Dependency updates: Apache Tika 1.12, Norconex Commons Lang 1.9.0. Jempbox 1.8.11 (still required by Tika JPegParser), PDFBox 2.0.0-RC3, Apache Commons CLI 1.3.1.
Updated	Importer-specific version of Tika PDFParser was updated to work around PDFBox 2.0 no longer depending on Jempbox.
Updated	Importer now issues a WARN instead of DEBUG sometimes thrown when importing fails.
Fixed	Fixed invalid zip bomb detection on PDF with elements nested more than 100 level deep.
Fixed	Fixed charset in HTML comments being wrongfully considered when charset is being detected.
Fixed	Fixed NullPointerException being thrown with some PDFs when extracting multilingual items.

2.4.0 Feature release Release date 2015-11-02 Download

New	The following new handlers enable using scripting languages to define processing logic: ScriptFilter, ScriptTagger, and ScriptTransformer.
New	New DOMContentFilter to filter out XML/HTML documents containing identified element or element value using a friendly syntax to navigate a DOM-tree structure.	#48
New	New DOMSplitter handler to split XML/HTML documents into multiple documents based on a specified element.
New	New DOMTagger handler to extract text elements from XML/HTML documents using a friendly syntax to navigate a DOM-tree structure.
New	CharacterCaseTagger can now be applied to field names (in addition to, or instead of, values).
New	New CommonRestrictions class to obtain restrictions commonly associated with certain documents.
New	New methods on AbstractImporterHandler to deal with restrictions: #addRestriction(PropertyMatcher...), #addRestrictions(List) #removeRestriction(String), #getRestrictions() #removeRestriction(PropertyMatcher), #clearRestrictions()
Updated	New file formats supported (brought by Tika update): GCMD DIF, Geographic ISO 19139 files, CBOR.
Updated	Dependency updates: Apache Tika 1.10, JSoup 1.8.3, Norconex Commons Lang 1.8.0.
Updated	Importer ExternalParser now uses corrected ExternalParser from Tika.
Updated	AbstractStringTransformer#transformStringContent(...) now throws an ImporterHandlerException.
Updated	Saved and loaded configuration-related classes are now equal. Methods equals/hashCode/toString for those classes are now implemented uniformly and where added where missing.
Fixed	Fixed some configuration classes not always being saved to XML properly or giving errors.

2.3.1 Maintenance release Release date 2015-08-07 Download

Updated	Dependency updates: Norconex Commons Lang 1.7.0.

2.3.0 Feature release Release date 2015-07-21 Download

New	New TextPatternTagger for extracting text matching regular expressions out of a document content and storing matches into a field. New unit tests created for it.
Updated	Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml).
Updated	Javadoc fixes and updates.
Updated	Library updates: Norconex Commons Lang 1.6.2.
Fixed	Fixed NullPointerException in DebugTagger when a field contains a null value.

2.2.0 Feature release Release date 2015-06-15 Download

New	New DocumentLengthTagger for adding the document byte length as a field to imported documents.
New	New CurrentDateTagger for adding the current date as a field to imported documents.
New	New NumericMetadataFilter for filtering documents based on whether a numeric field value matches a given numeric range.
New	New DateMetadataFilter for filtering documents based on whether a date field value matches a given date range.
New	New ExternalParser class which is used to run an external process for parsing files (e.g. pdftotext) of the associated content type.
Updated	By default PDF parsing is now done with this flag set to true: "suppressDuplicateOverlappingText". This should eliminate the extraction of duplicate text in PDF where bolding is done by having multiple instance of the same string on top of each other.
Updated	Complete rewrite of AbstractStringFilter, AbstractStringTagger, and AbstractStringTransformer to limit the memory taken for loading the content. Now the memory is specified in absolute terms instead of dynamically allocating it based on free memory (an approach that could cause OutOfMemory errors). All subclasses now accept a "maxReadSize" configuration option to set the maximum number of characters to process at once.	#9
Updated	The abstract methods accepting a "partial" boolean argument on AbstractStringFilter, AbstractStringTagger, and AbstractStringTransformer have been changed to now accept a "sectionIndex" integer, representing the document content section being processed. Only larger documents will be processed one section of text at a time (to preserve memory).
Updated	AbstractCharStreamTransformer#transformTextDocument(...) now throws an ImporterHandlerException instead of IOException to be consistent with other handlers.
Updated	TitleGeneratorTagger was re-written no longer uses Carrot, to reduce library dependencies.
Updated	Removed custom Tika mappings for Microsoft Visio now that they have been added to default Tika mappings in Tika 1.8. Reference: https://issues.apache.org/jira/browse/TIKA-1286
Updated	ReplaceTagger: now case insensitive by default. Added a new flag to turn case-sensitivity on/off. #addReplacement(...) methods have been deprecated in favor of addReplacement(Replacement).
Updated	Regular expressions in RegexContentFilter, RegexMetadataFilter, ReplaceTagger, TextBetweenTagger, ReplaceTransformer, StripAfterTransformer, StripBeforeTransformer, and StripBetweenTransformer now always have the Pattern.DOTALL flags enabled and when case sensitivity is enabled for regex, Pattern.UNICODE_CASE is now always used.
Updated	Library updates: Apache Tika 1.8, Norconex Commons Lang 1.6.1, Apache Commons CLI 1.3, Apache Jempbox 1.8.9, Jempbox 2.0.0. Removed these library "direct" dependencies: Carrot2 (3.9.4), Lucene Analyzers (5.0.0), and Stax2 API (3.1.4).
Updated	Javadoc fixes and updates.
Updated	New unit tests to cover all filter onMatch use cases.
Fixed	Fixed filters not working properly when using onMatch="include". Affects all subclasses of AbstractDocumentFilter, which now details the include/exclude logic in its Javadoc (github collector-http#108).
Fixed	Fixed "Too many open files" exception.
Fixed	Fixed the "restrictTo" feature not always working for AbstractImporterHandler subclasses.	#7

2.1.1 Maintenance release Release date 2015-04-08 Download

Updated	PDFBox now uses latest snapshot (as opposed to a frozen one).
Updated	Javadoc fixes.
Updated	Library updates: SLF4J 1.7.12.

2.1.0 Feature release Release date 2015-03-31 Download

New	Added OCR support using Tesseract open-source product. Configured by setting an OCRConfig to GenericDocumentParserFactory.
New	Added document translation support with the new TranslatorSplitter. Support these translation APIs: Microsoft, Google, Lingo24, and Moses. Both the document content and/or chosen fields can be translated.
New	New TitleGeneratorTagger to dynamically generate titles out of documents, using Carrot2 to extract the best terms.
New	New EnhancedPDFParser and EnhancedPDF2XHTML classes modifying original Tika PDFParser to add support for PDF XFA (dynamic forms) text extraction as well as adding support for PDFBox 2.0.0 (which fixes the striping of space characters between words in many PDFs).
New	New XFDLParser for parsing PureEdge Extensible Forms Description Language files (XFDL). Supports both Gzipped+Base64 and plain text versions.
New	New WordPerfectParser class for parsing WordPerfect documents according to WordPerfect file specifications.
New	New QuattroProParser class for parsing QuattroPro documents according to QuattroPro file specifications.
New	New configuration "parseErrorsSaveDir" on importer configuration for saving files that caused parsing errors along with their exception and metadata if any.
New	KeepOnlyTagger and DeleteTagger now supports regular expression for identifying fields to keep/delete. The field="" attribute has been replaced by a element.
New	Added support for JBIG2 and jpeg2000 image formats.
Updated	Improved content detection of MS Office and Corel Office documents when importing an input stream with no specified extension.
Updated	Improved overall content detection accuracy and performance.
Updated	Default allocated memory for caching of document content was increased by a factor of 10 (10MB max per document, 100MB max total).
Updated	AbstractTikaParser can now be extended to modify Tika ParseContext.
Updated	importer.bat and importer.sh will now load the log4j.properties from the ./classes folder.
Updated	Now always flush output stream from parsers so implementors do not have to be concerned with this.
Updated	Easier to extend GenericDocumentParserFactory to provide custom parsers. Dropped "registerNamedParser", "registerFallbackParser", and "getFallbackParser" methods in favor of new "createFallbackParser" and "createNamedParsers" methods.
Updated	HTMLParser and PDFParser are now deprecated. HTML and PDF are now handled by the fall-back parser (auto-detected).
Updated	IDocumentSplittableEmbeddedParser is now deprecated and has no effect. Will be deleted in a future release.
Updated	Minor javadoc improvements and fixes.
Updated	No longer adds null handlers (possible when configuration loading failed for an handler).
Updated	Improved exception handling for configuration loading.
Updated	Library updates: Tika 1.7, Norconex Commons Lang 1.6.0, JUnit 4.12, PDFBox 2.0.0 (SNAPSHOT-2015-03-28), Apache Commons Codec 1.10, Lucene Analyzer Common 5.0.0.
Updated	Updated several maven plugins and added SonarQube maven plugin.
Updated	Added Sonatype repository to pom.xml for snapshot releases.
Updated	Added more unit tests for various content type parsing.
Fixed	Fixed embedded objects not always having the right content-type.
Fixed	Fixed invalid mapping between "application/wordperfect" content type and WordPerfectParser.
Fixed	Fixed AbstractCharStreamTagger subclasses badly detecting character encoding and failing documents as a consequence.

2.0.0 Major release Release date 2014-11-25 Download

New	Importing now returns an ImporterResponse, which may hold the imported document, along with nested documents, and and ImporterStatus.
New	New IDocumentSplitter handler and related classes, allowing implementations to split documents into more documents.
New	DefaultDocumentParserFactory can now be configured to treat embedded documents as distinct documents (committed separately). Parsers can now implement IDocumentSplittableEmbeddedParser to indicate they are supporting document splitting.
New	DefaultDocumentParserFactory can now ignore parsing specified content-types.
New	New IImporterResponseProcessor to process the import response.
New	Document encoding can now be explicitly specified when importing and the value get stored as a metadata field.
New	New ContentTypeDetector for detecting the content-type from documents.
New	New ImporterDocument, holding all objects related to a document being imported.
New	New ImporterMetadata, extending Properties to provide additional import-related convenience methods and constants.
New	New CsvSplitter class for splitting coma-separated value files into multiple records/documents to be indexed.
New	New RegexContentFilter for accepting/rejecting documents based on a successful regular expression match on their content.
New	New CharacterCaseTagger for modifying the character case of a metadata field value.
New	New DateFormatTagger for parsing/formatting date from specified metadata fields.
New	New DebugTagger for logging document content and/or metadata to help with implementation and troubleshooting.
New	New LanguageTagger which analyzes a document content to automatically detect and store as metadata the document language.
New	New TextStatisticsTagger that stores as metadata statistical information about a document content (word count, average words per sentences, etc.).
New	New AbstractDocument* class for each types of handlers, facilitating handler implementation.
New	Directory where temporary files are created is now configurable.
New	Added support for parsing .iso files.
Updated	Now licensed under The Apache License, Version 2.0.
Updated	Document content reads and writes are now performed in memory up to a configurable maximum size, after which the filesystem gets used. This reduces I/O and improves performance.
Updated	Now every handlers except filters can be restricted to matching metadata values (configurable).
Updated	.tagger, .filter, and .transformer handlers were move to .handler.tagger, .handler.filter, and .handler.transformer.
Updated	com.norconex.importer.ContentType has been replaced with com.norconex.commons.lang.file.ContentType.
Updated	For consistency, several references to metadata field names were renamed to use the term "field" (instead of property or else).
Updated	DefaultDocumentParserFactory was renamed to GenericDocumentParserFactory.
Removed	Handler "contentTypeRegex" tag was removed from handlers that supported it in favor of the more flexible "restrictTo" tag(s).

1.3.0 Feature release Release date 2014-08-18 Download

New	Now stores the content "family" for each documents as "importer.contentFamily". This is a higher level representation of a file content types.
New	New SplitTagger: Split values into multiple-values using a separator of choice.
New	New CopyTagger: copies document metadata fields to other fields.
New	New HierarchyTagger: splits a field string into multiple segments representing each node of a hierarchical branch.
Updated	Improved detection of certain mime types, such as those previously appearing as application/x-tika-*.
Updated	ReplaceTagger now supports regular expressions (via a new "regex" flag).
Updated	Can now detect these MS Viso mime-types properly: vsdx, vstc, vssx, vsdm, vstm, vssm.
Updated	AbstractCharStreamTransformer now enforces streaming as UTF8.
Updated	Now requires Java 7 or higher.
Fixed	RelpaceTagger regular matching now only replaces matching "fromValue".

1.2.0 Feature release Release date 2014-03-09 Download

New	Now extracts text from WordPerfect documents (new WordPerfectParser class).
New	New transformer "ReduceConsecutivesTransformer" to reduce consecutive instances of the same string to only one instance.
New	New transformer "ReplaceTransformer" to perform search and replace on document content using regular expression.
New	New filter "EmptyMetadataFilter" to exclude/include documents with no data for one or more specified metadata properties.
Updated	Library updates: Tika 1.5, Norconex Commons Lang 1.3.0.
Updated	Now attempts to detect the character encoding from a character stream by looking at a Content-Type metadata. If none is present, defaults to UTF-8.
Fixed	Fixed NPE in AbstractTextRestrictiveHandler when no content-type is found when used before parsing.

1.1.0 Minor release Release date 2013-08-20 Download

New	New tagger "TextBetweenTagger" to extract strings from a document and store them into document meta data fields.
New	New AbstractRestrictiveHandler and AbstractTextRestrictiveHandler abstract classes to facilitate re-use of common capabilities in handlers.
New	New BufferUtil and Memory Util classes.
Updated	AbstractRestrictiveTransformer now deprecated.
Updated	Upgraded norconex-commons-lang to 1.1.0.

1.0.1 Maintenance release Release date 2013-08-02 Download

Updated	Upgraded Apache Tika from 1.3 to 1.4.
Updated	Removed dependency on aspectjrt due to GPL licencing incompatibility. If you need .iso parsing, you can manually download and add to the classpath.

1.0.0 Open Source release Release date 2013-06-04 Download

New	Starting with this release, Norconex Importer is open-source under GPL.