Norconex Crawler Core

1.x Release Notes

Release History

Version Date Description
1.10.1 2021-10-18 Maintenance release
1.10.0 2019-12-22 Feature release
1.9.1 2018-07-29 Maintenance release
1.9.0 2017-11-26 Feature release
1.8.2 2017-05-26 Bugfix release
1.8.1 2017-05-25 Maintenance release
1.8.0 2017-04-26 Feature release
1.7.0 2016-12-14 Feature release
1.6.0 2016-08-25 Feature release
1.5.0 2016-06-03 Feature release
1.4.0 2016-02-28 Maintenance release
1.3.0 2015-11-06 Feature release
1.2.1 2015-08-07 Maintenance release
1.2.0 2015-07-22 Feature release
1.1.0 2015-04-08 Feature release
1.0.2 2015-02-04 Bug fix release
1.0.1 2014-12-03 Bug fix release
1.0.0 2014-11-26 Initial release

1.10.1 Maintenance release Release date 2021-10-18 Download

Updated Now logging crawler id with events.
Updated Norconex JEF 4.1.3, Norconex Commons Lang 1.15.2
Updated New ICrawlDataStore#getProcessed(String) method.
Fixed Fixed sometimes not "remembering" bad status and error when using "grace once". #31

1.10.0 Feature release Release date 2019-12-22 Download

New Added "unmanaged" attribute to "logsDir" configuration option to prevent the collector from managing its own file-based logging.
New New "maxParallelCrawlers" collector configuration option. Allows to run only a maximum number of crawlers at any given time, queuing the others. #25
New Added SSL support to MongoDB crawl data store.
New New AbstractCollector#getState() method.
New Added advanced configuration parameters to MVStoreCrawlDataStoreFactory.
Updated Maven dependency updates: Norconex Commons Lang 1.15.1, Norconex Importer 2.10.0, Norconex Committer Core 2.1.3, Norconex JEF 4.1.2, H2 1.4.199.
Fixed SpoiledReferenceStrategy of GRACE_ONCE now properly delete a document on subsequent fail.
Fixed Add a retry to MongoDB upserts to fix getting a constraint violation on concurrent upserts. #24

1.9.1 Maintenance release Release date 2018-07-29 Download

Updated Significant performance improvement on MongoCrawlDataStore#isQueueEmpty().
Updated Dependency updates: Norconex Importer 2.9.0, Norconex Commons Lang 1.15.0.
Updated AbstractCrawler now logs documents it could not process as INFO.
Updated MongoCrawlDataStore #buildMongoClient abd #buildMongoCredentials methods were moved to MongoConnectionDetails. #15
Fixed Fixed embedded document checksums creation pulling the wrong cached checksum causing them to always appear new when metadataChecksummer is disabled..
Fixed Fixed showing wrong path in error message when command-line variable file is invalid. #16
Fixed Fixed NullPointerException under some conditions for AbstractCrawlerConfig#saveToXML(...).

1.9.0 Feature release Release date 2017-11-26 Download

New New "sourceFieldsRegex" option on GenericMetadataChecksummer and MD5DocumentChecksummer allowing the use of regular expressions to match the fields to use for building the checksum.
New New "combineFieldsAndContent" option on MD5DocumentChecksummer to use both fields and content for building the checksum.
New Can now specify custom collection names when using MongoCrawlDataStore and AbstractMongoCrawlDataStoreFactory implementations.
New New "stopOnExceptions" added to crawler configuration to force crawler to stop upon encountering a specified exceptions.
Updated The MongoCrawlDataStore now accepts references longer than 1024 characters.
Updated AbstractCrawler no longer create work directory on object construction, but rather does it when the crawler starts.
Updated Dependency updates: Norconex Importer 2.8.0, Norconex Commons Lang 1.14.0, Norconex Committer Core 2.1.2, Apache Commons DbUtils 1.7, MongoDB Java Driver 3.5.0, H2 Database 1.4.196.
Fixed When orphan strategy is "PROCESS", the crawler now always attempts to process a document, regardless of sitemap delays or recrawlable delays, since the reason for it to become orphan may be deletion, and we do not want to wait a future crawl cycle to find out.

1.8.2 Bugfix release Release date 2017-05-26 Download

Updated Dependency updates: Norconex Importer 2.7.2.
Fixed Fixed "caseSensitive" flag sometimes having no effect in RegexMetadataFilter and RegexReferenceFilter.

1.8.1 Maintenance release Release date 2017-05-25 Download

New MongoCrawlDataStore now support specifying the MongoDB authentication mechanism to use (MONGODB-CR or SCRAM-SHA-1).
Updated Classes related to MongoDB crawl store implementation were updated to use MongoDB 3.x API.
Updated Dependency updates: Norconex Importer 2.7.1, Norconex Committer Core 2.1.1, Mongodb Driver 3.4.2, Fongo 2.0.13 (for tests).
Updated AbstractCollector#saveToXML(...) now written with xml:space="preserve".
Fixed Fixed "importer" config section not being inherited from "crawlerDefaults" when a specific crawler configuration does not declare one.

1.8.0 Feature release Release date 2017-04-26 Download

New Added schema-based XML configuration validation which can be trigged on command prompt with this new flag: -k or --checkcfg
New New ICollectorLifeCycleListener interface that can be added on the collector configuration to be notified and take action when the collector starts and stops.
New Two new crawler events where added for crawler event listeners: CRAWLER_STOPPING and CRAWLER_STOPPED.
New AbstractMongoCrawlDataStoreFactory now accepts encrypted passwords.
New Now distributed with utility scripts.
Updated Crawler events REJECTED_FILTER, REJECTED_BAD_STATUS, REJECTED_IMPORT, and REJECTED_ERROR are now DEBUG in log4j.properties.
Updated When their log level is DEBUG, the word "Subject:" has been removed form crawler event messages and "No additional information available." is shown when there is no extra info to show.
Updated Dependency updates: Norconex Commons Lang 1.13.0, Norconex Importer 2.7.0, Norconex JEF API 4.1.0, Norconex Committer Core 2.1.0, JSoup 1.10.2.
Updated Modified Javadoc to include an XML usage example for all XML-configurable classes.
Updated ICrawlerConfig no longer implements Cloneable.
Updated Document, metadata, and reference filters now logs appropriate message when there is no "include" match, when log level is DEBUG.
Fixed Fixed crawler defaults not always being applied as it should.
Fixed Fixed minor errors in writing IXMLConfigurable classes to XML.
Fixed Throwable exceptions no longer makes a crawler hang under certain conditions when importing/parsing a file.
Removed Removed code deprecated in version 1.2 or older.
Removed Removed MapDB and Apache Derby crawlstore dependencies/implementations which were deprecated in version 1.6.

1.7.0 Feature release Release date 2016-12-14 Download

New It is now possible to add JEF-related listeners on the collector configuration.
Updated JXM support is not disabled by default to improve performance. It can be enabled by adding the JVM argument : -DenableJMX=true
Updated Dependency updates: Norconex Commons Lang 1.12.3, Norconex Importer 2.6.1, Norconex JEF API 4.0.8, Joda Time 2.9.4, JJ2000 5.3, Apache HTTP Client 4.5.2, Apache HTTP Core 4.4.5, Apache Commons Logging 1.2
Fixed Fixed NullPointerException when stopping a crawler that did not previously run.

1.6.0 Feature release Release date 2016-08-25 Download

New New "checkcfg" launch action that will load a configuration without doing anything with it (to help resolve config issues).
New New CrawlState#isSkipped() method to indicate if a document was unmodified or premature.
New New AbstractCrawler#beforeFinalizeDocumentProcessing() method to let crawler implementations act on a document before it is being finalized.
Updated MVStoreCrawlDataStoreFactory is now the default crawl store factory (replacing now deprecated MapDB implementation).
Updated Dependency updates: Norconex Importer 2.6.0, Norconex Committer Core 2.0.5, JSoup 1.9.2, Apache Commons DBCP 2.1.1, H2 Database 1.4.192.
Updated API break: method signature changed for AbstractCrawler from applyCrawlData(ICrawlData crawlData, ImporterDocument document) to initCrawlData(ICrawlData crawlData, ICrawlData cachedCrawlData, ImporterDocument document).

1.5.0 Feature release Release date 2016-06-03 Download

New New BasicJDBCCrawlDataStoreFactory implementation for collector implementations with basic crawl storage needs.
New New document crawl state: PREMATURE.
New New crawler event: REJECTED_PREMATURE.
Updated Default database implementation for AbstractJDBCDataStoreFactory when invoked with an empty constructor is now H2.
Updated When provided by collectors, document "crawl date" and content type can be added to the crawl data and will be stored in the crawl data store (affects all ICrawlDataStoreFactory implementations).
Updated Dependency updates: Norconex Importer 2.5.2, MapDB 1.0.9, H2 1.4.191, Fongo 1.6.2.
Updated Event string value for DOCUMENT_COMMITTED_REMOVE changed from DOCUMENT_COMMITTED_REMOV to DOCUMENT_COMMITTED_REMOVE.

1.4.0 Maintenance release Release date 2016-02-28 Download

Updated Dependency updates: Norconex Importer 2.5.0.
Updated ExtensionReferenceFilter is now smarter at detecting extension. #2
Updated ExtensionReferenceFilter now allows white spaces around extensions in XML config.

1.3.0 Feature release Release date 2015-11-06 Download

Updated Specifying an invalid path on the command-line for the config file or variable file now returns a meaningful message.
Updated Maven direct dependency updates: Norconex Importer 2.4.0, Norconex JEF 4.0.7, Mongo Java Driver 2.13.3, Apache Derby 10.12.1.1.
Updated Now logs (leve INFO) a less alarming message when a module version cannot be found.
Updated Now logs module version information in file.
Updated A new metadata boolean field called "collector.is-crawl-new" is now added before document importing. It indicates whether the document is already known from the crawler, from a previous run.
Updated Cached instance of a reference data is now passed around as opposed to being obtained form the reference cache each time it is needed.
Updated Saved and loaded configuration-related classes are now equal. Methods equals/hashCode/toString for those classes are now implemented uniformly and where added where missing.
Fixed Fixed some configuration classes not always being saved to XML properly or giving errors.
Fixed Fixed IOException when "keepDownloads" is true. This was occurring for URLs with no path (just the host name). Now prefixes created domain directory domain file with "d." and "f." respectively.

1.2.1 Maintenance release Release date 2015-08-07 Download

Updated AbstractCrawler is no longer deleting remaining orphans after they have been processed (when orphan strategy is PROCESS).
Updated Verbose logging in AbstractCrawler#processNextReference(...) has been changed from loglevel DEBUG to TRACE.
Updated Dependency updates: Norconex Importer 2.3.1 and Norconex Committer Core 2.0.2.

1.2.0 Feature release Release date 2015-07-22 Download

New New configurable option: ISpoiledStateStategyResolver. It allows one to customize what strategy to adopt when a reference is in a bad crawl state (ignore, delete, or grace once). A default implementation is provided: GenericSpoiledStateStrategyResolver.
New New GenericMetadataChecksummer for choosing one or many metadata fields and their values to create a checksum.
New Now printing release versions of Norconex libraries used when a collector is launched.
New New NOT_FOUND state constant added to CrawlState (migrated from the HTTP Collector).
Updated AbstractCrawler is now firing REJECTED_ERROR events when an exception prevented proper processing of a reference.
Updated Documents with a bad crawl state other than "NOT_FOUND" are now given once chance to recover before a deletion request gets sent. This can be overwritten.
Updated The OrphansStrategy default in crawler config is now PROCESS to get around cases where temporary conditions prevent accessing some documents that normally should (and should not avoid re-processing on incremental crawls).
Updated MD5DocumentChecksummer#setField(String) has been deprecated in favor of MD5DocumentChecksummer#setFields(String...).
Updated CrawlState#isCommittable() has been deprecated in favor of CrawlState#isNewOrModified().
Updated Setter methods signatures accepting an array in AbstractCrawlerConfig were updated to accept "varargs" instead (variable arguments).
Updated Uses default port when no Mongo port is specified when using Mongo data store.
Updated When the saving of documents is enabled, each saved documents is no longer printed to STDOUT but logged as a Log4j debug statement instead.
Updated Regular expressions in RegexMetadataFilter and RegexReferenceFilter now always have the Pattern.DOTALL flag enabled and when case sensitivity is enabled for regex, Pattern.UNICODE_CASE is now always used.
Updated Library updates: Norconex JEF 4.0.6, Norconex Importer 2.3.0, Norconex Commons Lang 1.6.2, Mongo Java Driver 2.13.2, H2 database 1.4.187. New dependency: JUnit 4.12 (test scope).
Updated Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml).
Updated Javadoc fixes and updates.
Fixed Updated Mongo indexes to use stage instead of state. (Github collector-http#97).
Fixed Stopping a job that has been resumed now works as expected.
Fixed Stopping a job that has been resumed now works as expected.
Removed ICrawlDataStore#isVanished(ICrawlData) has been deprecated.

1.1.0 Feature release Release date 2015-04-08 Download

New New methods and configuration attribute to disable checksum creation in MD5DocumentChecksummer.
Updated Library updates: Norconex Committer Core 2.0.1, Norconex Importer 2.1.1, Norconex JEF 4.0.4, MapDB 1.0.7, Apache Commons BeanUtils 1.9.2, Apache Commons DBCP2 2.1, Mongo Java Driver 2.13.0, H2 1.4.186.
Updated Added Sonatype repository to pom.xml for snapshot releases.
Updated Updated several maven plugins and added SonarQube maven plugin.
Updated Removed pom.xml dependency on Norconex Commons Lang, which is already provided by other dependencies.
Updated Subject in event logging is now only shown on DEBUG log level.
Updated The database XML configuration in AbstractJDBCDataStoreFactory is now case-insensitive.
Updated H2 database now has a write delay of zero to ensure durability on JVM crash.
Updated MapDB and MVStore implementation of ICrawlDataStore now forces a commit on every addition a the expense of performance to ensure durability on JVM/OS/System crash.
Fixed BaseCrawlData#setDocumentChecksum(String) is now deprecated in favor of BaseCrawlData#setContentChecksum(String) to fix content checksum not being saved in crawl data store properly.
Fixed Fixed NullPointerException when running an incremental crawl over one that previously failed due to invalid configuration.
Fixed Fixed incremental run not always handling non-modified documents properly (sometimes deleting, sometimes re-adding).
Fixed Fixed NPE in AbstractJDBCDataStoreFactory#createCrawlDataStore(...) when database is null.

1.0.2 Bug fix release Release date 2015-02-04 Download

Fixed When splitting documents, crawlers will now trigger individual processing/deletion of children/embedded documents that no longer exists on incremental runs (based on your "orphansStrategy" configuration). When deleting orphans, deletion of a parent document will also trigger deletion requests to its children/embedded documents.
Fixed Fixed an infinite loop that sometime occurred when dealing with multiple threads and the configured maxDocument is reached (and greater than zero). This could prevent a collector from ever stopping.
Fixed Fixed invalid detection of crawler execution state, affecting ability to stop a collector.
Fixed Crawl data is no longer added to document metadata after the import phase (which could conflict with some handlers, like KeepOnlyTagger).
Updated Default logging of Crawler events is now better aligned.
Updated Updated JEF API to version 4.0.2.
Updated Javadoc corrections.

1.0.1 Bug fix release Release date 2014-12-03 Download

Fixed When keepDownloads is true, saved files and directories are now prefixed with "f." and "d." respectively to avoid collisions.
Updated Crawler id is now set on JEF JobSuite when a new thread starts to improve logging.
Updated Upgraded norconex-jef to 4.0.1.

1.0.0 Initial release Release date 2014-11-26 Download

New Initial release.