Norconex Crawler Core

2.x Release Notes

Release History

Version Date Description
2.1.0-SNAPSHOT 2024-??-?? Minor release.
2.0.2 2023-07-09 Maintenance release.
2.0.1 2022-08-30 Maintenance release.
2.0.0 2022-01-02 Major release. NOT a drop-in replacement for 1.x.

2.1.0-SNAPSHOT Minor release. Release date 2024-??-?? Download

This release is currently in development and the following information may change.
Updated Minimum Java Version is now 11.
Fixed Fixed crawler throwing error when issuing a stop command.

2.0.2 Maintenance release. Release date 2023-07-09 Download

New New "deferredShutdownDuration" collector configuration option to delay the collector shutdown when it's done executing.
Updated Maven dependency updates: norconex-commons-maven-parent 1.0.2, H2 2.2.220, JSoup 1.15.3.
Updated JMX crawler MBeans are now unregistered as the last thing before collector shutdown.

2.0.1 Maintenance release. Release date 2022-08-30 Download

New New MDC attributes which can be used in supporting logging framework: "ctx:crawler.id", "ctx:crawler.id.safe", "ctx:collector.id.safe", and "ctx:collector.id.safe".
Fixed Fixed occasional concurrency issue when crawler terminates.
Fixed Fixed the crawler sometimes not exiting when done.

2.0.0 Major release. NOT a drop-in replacement for 1.x. Release date 2022-01-02 Download

Updated Updated transitive dependencies with known vulnerabilities.
Updated The name of the data store engine "storetypes" collection/table has been shorten to just the class "simple" name + "--storetypes".
Updated Updated dependencies to avoid logging library detection conflict.
Updated Updated JdbcDataStoreEngine table name automatic creation to take into account more special characters.
Fixed Fixed "maxConcurrentCrawlers" throwing IllegalStateException with "Connection Pool Shut Down" message when different than default value or does not match the number of crawlers.
Fixed Fixed JdbcDataStoreEngine#getStoreNames not returning proper names.
Fixed Fixed JdbcDataStoreEngine XML configuration being loaded twice.
Fixed Fixed MongoDataStore#deleteFirst not successfully deleting and returning the first record.
Fixed Fixed data store deserialization not taking into account sub-types, affecting JDBC and MongoDB implementations.
Fixed Fixed data store engine resources not being being included as part of the crawler resource cleaning process.
Fixed Fixed throwing an error when trying to log the execution summary after the data store engine was closed.
New New StopCrawlerOnMaxEventListener class to stop crawlers upon reaching a maximum number of specific crawler events.
New New DeleteRejectedEventListener class to delete documents matching specific document "rejected" events.
New Added deduplication configuration options via CrawlerConfig#setMetadataDeduplicate and CrawlerConfig#setDocumentDeduplicate
New New crawler event: REJECTED_DUPLICATE.
Updated Maven dependency updates: MongoDB Driver 4.3.2, Testcontainers 1.16.0.
Updated Launching crawler now sets crawler name as thread name even before starting to process references.
Updated Metadata checksummer now an element of CrawlerConfig.
Updated Checksummers "targetField", "sourceFields", and "sourceFieldsRegex" are deprecated in favor of "toField" and "fieldMatcher".
Updated RegexMetadataFilter and RegexReferenceFilter have been deprecated in favor or MetadataFilter and ReferenceFilter.
Updated Checksummers "disabled" flag deprecated in favor of setting a null checksummer or using a self-closed checksummer tag in config.
Fixed Fixed invalid configuration in POM "maven-dependency-plugin".
New Added JdbcDataStoreEngine as a data store implementation.
New Added "crawlersStartInterval" configuration option.
New New crawler events: DOCUMENT_QUEUED, DOCUMENT_PROCESSED.
New JMX reporting now returns active references and event counts.
New Now provides execution summary and the end of a crawler execution.
Removed Removed JEF dependency in favor of improved JMX for tracking.
New Now supports providing multiple committers.
New New collector events: COLLECTOR_RUN_BEGIN, COLLECTOR_RUN_END, COLLECTOR_STOP_BEGIN, COLLECTOR_STOP_END, COLLECTOR_CLEAN_BEGIN, COLLECTOR_CLEAN_END, COLLECTOR_STORE_EXPORT_BEGIN, COLLECTOR_STORE_EXPORT_END, COLLECTOR_STORE_IMPORT_BEGIN, COLLECTOR_STORE_IMPORT_END
New New crawler events: CRAWLER_INIT_BEGIN, CRAWLER_INIT_END, CRAWLER_RUN_BEGIN, CRAWLER_RUN_END, CRAWLER_STOP_BEGIN, CRAWLER_STOP_END, CRAWLER_CLEAN_BEGIN, CRAWLER_CLEAN_END.
New New method on CrawlerEvent: isCrawlerShutdown.
New New UNSUPPORTED crawl state.
New New Collector#clean() method and related events.
New New Collector#exportDataStore(), Collector#importDataStore() methods and related events.
New New .core.reference package along with new .core.store package for storing of URL crawling information.
New New IDataStoreEngine accessible from crawler to store any kind of objects by implementors in their own extensions.
New AbstractDocumentChecksummer and AbstractMetadataChecksummer classes (and their subclasses) now have an "onSet" configurable option for dictating how values are set: append, prepend, replace, optional.
New New CrawlDoc, CrawlDocInfo, and CrawlDocMetadata (either new or renamed).
New New Crawler#isQueueInitialized() method to support asynchronous reference queueing.
New Now logging throughput (documents per seconds) and estimated remaining time.
Updated Now always resume previous incomplete executions. Can now "clean" to start fresh.
Updated Now using XML class from Norconex Commons Lang for loading/saving configuration.
Updated Now using SLF4J for logging.
Updated Lists are now replacing arrays in most places.
Updated ICollector, ICollectorConfig, ICrawler, ICrawlerConfig were all replaced with Collector, CollectorConfig, Crawler, and CrawlerConfig.
Updated Default working directory structure has been modified.
Updated Path is used in addition/instead of File in many places.
Updated Configurable CollectorLifeCycleListener, IJobLifeCycleListener, IJobErrorListener, ISuiteLifeCycleListener, ICrawlerEventListener all replaced with IEventListener. These new listeners can be set on the collector configuration, or be implemented on configuration objects and automatically be detected.
Updated Dependency updates: Norconex Importer 3.0.0, Norconex JEF 5.0.0, Norconex Commons Lang 2.0.0, Norconex Committer 3.0.0, H2 1.4.197.
Updated CrawlerConfig#OrphanStrategy is now public.
Updated Now requires Java 8 or higher.
Updated Command-line arguments are now different, with more options such as "cleaning" previous executions, importing/exporting the crawl store and forcing a commit of any remains from committer queue, rendering of configuration file once interpreted, etc.
Updated Now use simple file-locks to prevent running conflicting commands concurrently.
Updated Dates now takes the zone into consideration.
Updated Collector "maxParallelCrawlers" is now deprecated in favor of "maxConcurrentCrawlers".
Removed Removed "data" package in favor of "reference" package.
Removed Removed some of the deprecated code from 1.x.
Removed Removed CRAWLER_RESUMED crawler event.
Removed Removed CollectorConfigLoader, CollectorLifeCycleListener, CrawlerLifeCycleListener, IJobLifeCycleListener, IJobErrorListener, ISuiteLifeCycleListener, ICrawlerEventListener (replaced by IEventListener).
Removed Remove all previously available crawl store implementions in favor of new MVStoreDataStore.