Norconex File System Crawler

2.x Release Notes

Release History

Version Date Description
2.9.2-SNAPSHOT 2023-07-09 Bug fix release
2.9.1 2021-10-18 Bug fix release
2.9.0 2019-12-22 Feature release
2.8.0 2017-11-26 Feature release
2.7.1 2017-05-26 Maintenance release
2.7.0 2017-04-26 Feature release
2.6.1 2016-12-14 Maintenance release
2.6.0 2016-08-25 Feature release
2.5.0 2016-06-03 Minor release
2.4.0 2016-02-28 Minor release
2.3.0 2015-11-06 Feature release
2.2.0 2015-07-22 Feature release
2.1.0 2015-04-08 Feature release
2.0.2 2015-02-04 Bug fix release
2.0.1 2014-12-03 Bug fix release
2.0.0 2014-11-26 Major release.
1.0.0 2014-08-25 Initial release

2.9.2-SNAPSHOT Bug fix release Release date 2023-07-09 Download

This release is currently in development and the following information may change.
Fixed File names with extended UTF-8 characters are now read properly. Non ASCII UTF-8 characters are no longer escaped when read. Now only escapes a handful of ASCII characters that could cause issues in converting a file name to a URI. #65

2.9.1 Bug fix release Release date 2021-10-18 Download

Fixed Fixed exception when retrieving ACL on Windows local filesystem when the drive letter is different than the crawler current directory. #54
Fixed Fixed invalid URI escape sequence error when dealing with local paths having URI-invalid characters in them.

2.9.0 Feature release Release date 2019-12-22 Download

New Now extracts ACL from local files.
New From Collector Core update, added "unmanaged" attribute to "logsDir" configuration option to prevent the collector from managing its own file-based logging.
New Now supports CMIS (Atom), the open standard for content management systems. E.g., Alfresco, Interwoven, Magnolia, SharePoint server, OpenCMS, OpenText Documentum, etc.
Updated Dependency updates: Norconex Collector Core 1.10.0, Norconex Commons Lang 1.15.1.
Fixed Fixed files with pound sign being ignored and/or having the pound sign URL-encoded. #47
Fixed Fixed NullPointerException under some conditions for FilesystemCrawlerConfig#saveToXML(...). #29

2.8.0 Feature release Release date 2017-11-26 Download

New Several new features (new TruncateTagger, ExternalTagger, etc.) are included with this release, mainly through Norconex Collector Core and Norconex Importer dependency updates. Refer to related release notes for more details.
Updated Dependency updates: Norconex Collector Core 1.9.0, Norconex Commons Lang 1.14.0.

2.7.1 Maintenance release Release date 2017-05-26 Download

Updated Dependency updates: Norconex Collector Core 1.9.0.

2.7.0 Feature release Release date 2017-04-26 Download

New Added schema-based XML configuration validation which can be trigged on command prompt with this new flag: -k or --checkcfg
New New configurable GenericFilesystemOptionsProvider which allows to configure how different file systems are accessed (authentication, FTP(s), HTTP, Webdav, etc). Custom implementation can be provided with IFilesystemOptionsProvider.
New ACL is now extracted from SMB/CIFS file systems.
New Custom metadata extraction is now possible via IFileMetadataFetcher. Default implementation is GenericFileMetadataFetcher.
New Custom document extraction is now possible via IFileDocumentFetcher. Default implementation is GenericFileDocumentFetcher.
New Can now provide start paths dynamically with new IStartPathsProvider.
New New features from dependency updates. Collector Core: ICollectorLifeCycleListener. Importer: MergeTagger, ExternalTransformer.
New MongoCrawlDataStoreFactory now accepts encrypted passwords.
New Now distributed with utility scripts.
Updated XML configuration entries expecting millisecond durations can now be provided in human-readable format (e.g., "5 minutes and 30 seconds" or "5m30s").
Updated Dependency updates: Norconex Collection Core 1.8.0, Norconex Commons Lang 1.13.0, JCIFS 1.3.17, Apache Commons VFS Sandbox 2.1.
Updated Crawler events REJECTED_FILTER, REJECTED_BAD_STATUS, REJECTED_IMPORT, and REJECTED_ERROR are now DEBUG in log4j.properties.
Updated FilesystemCollectorException now deprecated in favor of CollectorException.
Updated Modified Javadoc to include an XML usage example for all XML-configurable classes.
Fixed Fixed minor errors in writing IXMLConfigurable classes to XML.
Removed Removed JDBCCrawlDataStoreFactory deprecated since 1.5 (replaced since by BasicJDBCCrawlDataStoreFactory).

2.6.1 Maintenance release Release date 2016-12-14 Download

Updated Dependency updates: Norconex Commons Lang 1.12.3, JJ2000 5.3, Norconex Collection Core 1.7.0, Apache HTTP Client 4.5.2, Apache HTTP Core 4.4.5, Apache Commons Codec 1.10, Apache Commons Net 3.5, Apache HttpClient 3.1.
Fixed Fixed FTP file system. Added thrid-party dependencies and FTP configuration required for FTP file system to work. #11

2.6.0 Feature release Release date 2016-08-25 Download

Updated Dependency updates: Norconex Collector Core 1.6.0, Apache Commons VFS 2.1, Joda Time 2.9.4, JSoup 1.8.3, and Norconex Importer 2.6.0, which introduces new document parsing/manipulation features.

2.5.0 Minor release Release date 2016-06-03 Download

Updated MVStore is now the default URL crawl store.
Updated Dependency updates: Norconex Collector Core 1.5.0.
Updated JDBCCrawlDataStoreFactory now deprecated in favor of BasicJDBCCrawlDataStoreFactory from Collector Core.

2.4.0 Minor release Release date 2016-02-28 Download

New Now supports specifying relative paths in startPaths (for local file systems only).
Updated The "log4j.properties" file has been moved from classes to the installation root directory.
Updated Dependency updates: Norconex Collector Core 1.4.0, Joda Time 2.9.2.

2.3.0 Feature release Release date 2015-11-06 Download

Updated Dependency updates: Norconex Collector Core 1.3.0 and Norconex Importer 2.4.0, which introduces many new features.

2.2.0 Feature release Release date 2015-07-22 Download

New New CurrentDateTagger, DateMetadataFilter, NumericMetadataFilter, TextPatternTagger, GenericSpoiledReferenceStrategizer and more new features introduced by dependency upgrades.
New New FileMetadataChecksummer#setDisabled(boolean) method to disable this default metadata checksummer.
Updated Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml).
Updated Dependency updates: Norconex Collector Core 1.2.0.
Updated Improved/fixed javadoc.

2.1.0 Feature release Release date 2015-04-08 Download

New Several new features, updates and fixes were added by upgrading Norconex Collector Core (http://www.norconex.com/collectors/collector-core/) and Norconex Importer (http://www.norconex.com/collectors/importer/) dependencies. Those include support for ORC, translation, a title generator, new content type parsing, and more. Refer to dependency release notes for more details.
Updated Library updates: Norconex Collector Core 1.1.0, Junit 4.12, Joda-Time 2.7.
Updated Added Sonatype repository to pom.xml for snapshot releases.
Updated Updated several maven plugins and added SonarQube maven plugin.
Fixed Fixed log4j log levels incorrectly ending with a semi-colon.

2.0.2 Bug fix release Release date 2015-02-04 Download

Fixed Fixed the collector "stop" action having no effect.
Fixed Fixed crawl data wrongfully applied as metadata after the import phase.
Fixed Fixed incorrect deletion behavior for embedded orphan documents.
Updated Improved log4j.properties logging options for crawler events.
Updated Upgraded Norconex Collector Core dependency to 1.0.2.

2.0.1 Bug fix release Release date 2014-12-03 Download

Fixed From collector-core-1.0.1: When keepDownloads is true, saved files and directories are now prefixed with "f." and "d." respectively to avoid collisions. #44

2.0.0 Major release. Release date 2014-11-26 Download

New Upgraded Norconex Importer to version 2.0.0, which brings to Norconex Filesystem Collector a lot of new features, such as: Document content splitting, splitting of embedded documents into individual documents, new taggers for language detection, changing character case, parsing and formatting dates, providing content statistics, and more. Please read the Norconex Importer release notes for a complete list of changes at: http://www.norconex.com/product/importer/changes-report.html#a2.0.0
New Can now supplied a "pathsFile" as part of the startPaths, acting as a seed list.
New New H2 database implementation for the reference database (crawl data store).
New Now keeps track of parent references (for embedded/split documents).
New New replaceable FileMetadataChecksummer which takes the document modified date and size to create a unique representation of a file.
New New IFileDocumentProcessor to manipulated crawled document prior and after the import module is invoked.
New New support for files filtering based on their Metadata.
New New support for document filtering.
New New ability to keep files fetch from a filesystem to a local location.
New New JMX/MBean support added on crawlers.
Updated Now licensed under The Apache License, Version 2.0.
Updated Replaced the configuration option "deleteOrphans(true|false)" with "orphansStrategy(DELETE|PROCESS|IGNORE)".
Updated The collector now references document content as reusable InputStream with memory caching instead of relying only on files. This saves a great deal of disk I/O and improves performance in most cases.
Updated Refactored to use the new Norconex Collector Core library. A significant portion of the Norconex Filesystem Collector code has been moved to that core library.
Updated New and more scalable crawler event model along with new listeners.
Updated Refactored to use JEF 4.0.0 which makes the HTTP Collector easier to monitor.
Updated Other libray upgrades: Norconex Committer to 2.0.0 and Norconex Commons Lang to 1.5.0.

1.0.0 Initial release Release date 2014-08-25 Download

New Initial release.