Updated |
Updated transitive dependencies with known vulnerabilities.
|
|
Updated |
The name of the data store engine "storetypes" collection/table has
been shorten to just the class "simple" name + "--storetypes".
|
|
Updated |
Updated dependencies to avoid logging library detection conflict.
|
|
Updated |
Updated JdbcDataStoreEngine table name automatic creation to take into
account more special characters.
|
|
Fixed |
Fixed "maxConcurrentCrawlers" throwing IllegalStateException with
"Connection Pool Shut Down" message when different than default value
or does not match the number of crawlers.
|
|
Fixed |
Fixed JdbcDataStoreEngine#getStoreNames not returning proper names.
|
|
Fixed |
Fixed JdbcDataStoreEngine XML configuration being loaded twice.
|
|
Fixed |
Fixed MongoDataStore#deleteFirst not successfully deleting and
returning the first record.
|
|
Fixed |
Fixed data store deserialization not taking into account sub-types,
affecting JDBC and MongoDB implementations.
|
|
Fixed |
Fixed data store engine resources not being being included as part
of the crawler resource cleaning process.
|
|
Fixed |
Fixed throwing an error when trying to log the execution summary
after the data store engine was closed.
|
|
New |
New StopCrawlerOnMaxEventListener class to stop crawlers upon reaching
a maximum number of specific crawler events.
|
|
New |
New DeleteRejectedEventListener class to delete documents matching
specific document "rejected" events.
|
|
New |
Added deduplication configuration options via
CrawlerConfig#setMetadataDeduplicate and
CrawlerConfig#setDocumentDeduplicate
|
|
New |
New crawler event: REJECTED_DUPLICATE.
|
|
Updated |
Maven dependency updates: MongoDB Driver 4.3.2, Testcontainers 1.16.0.
|
|
Updated |
Launching crawler now sets crawler name as thread name even
before starting to process references.
|
|
Updated |
Metadata checksummer now an element of CrawlerConfig.
|
|
Updated |
Checksummers "targetField", "sourceFields", and "sourceFieldsRegex"
are deprecated in favor of "toField" and "fieldMatcher".
|
|
Updated |
RegexMetadataFilter and RegexReferenceFilter have been deprecated
in favor or MetadataFilter and ReferenceFilter.
|
|
Updated |
Checksummers "disabled" flag deprecated in favor of setting a null
checksummer or using a self-closed checksummer tag in config.
|
|
Fixed |
Fixed invalid configuration in POM "maven-dependency-plugin".
|
|
New |
Added JdbcDataStoreEngine as a data store implementation.
|
|
New |
Added "crawlersStartInterval" configuration option.
|
|
New |
New crawler events:
DOCUMENT_QUEUED, DOCUMENT_PROCESSED.
|
|
New |
JMX reporting now returns active references and event counts.
|
|
New |
Now provides execution summary and the end of a crawler execution.
|
|
Removed |
Removed JEF dependency in favor of improved JMX for tracking.
|
|
New |
Now supports providing multiple committers.
|
|
New |
New collector events: COLLECTOR_RUN_BEGIN, COLLECTOR_RUN_END,
COLLECTOR_STOP_BEGIN, COLLECTOR_STOP_END,
COLLECTOR_CLEAN_BEGIN, COLLECTOR_CLEAN_END,
COLLECTOR_STORE_EXPORT_BEGIN, COLLECTOR_STORE_EXPORT_END,
COLLECTOR_STORE_IMPORT_BEGIN, COLLECTOR_STORE_IMPORT_END
|
|
New |
New crawler events: CRAWLER_INIT_BEGIN, CRAWLER_INIT_END,
CRAWLER_RUN_BEGIN, CRAWLER_RUN_END,
CRAWLER_STOP_BEGIN, CRAWLER_STOP_END,
CRAWLER_CLEAN_BEGIN, CRAWLER_CLEAN_END.
|
|
New |
New method on CrawlerEvent: isCrawlerShutdown.
|
|
New |
New UNSUPPORTED crawl state.
|
|
New |
New Collector#clean() method and related events.
|
|
New |
New Collector#exportDataStore(), Collector#importDataStore() methods
and related events.
|
|
New |
New .core.reference package along with new .core.store package
for storing of URL crawling information.
|
|
New |
New IDataStoreEngine accessible from crawler to store any kind
of objects by implementors in their own extensions.
|
|
New |
AbstractDocumentChecksummer and AbstractMetadataChecksummer classes
(and their subclasses) now have an "onSet" configurable option for
dictating how values are set: append, prepend, replace, optional.
|
|
New |
New CrawlDoc, CrawlDocInfo, and CrawlDocMetadata (either new
or renamed).
|
|
New |
New Crawler#isQueueInitialized() method to support asynchronous
reference queueing.
|
|
New |
Now logging throughput (documents per seconds) and estimated remaining
time.
|
|
Updated |
Now always resume previous incomplete executions. Can now "clean"
to start fresh.
|
|
Updated |
Now using XML class from Norconex Commons Lang for loading/saving
configuration.
|
|
Updated |
Now using SLF4J for logging.
|
|
Updated |
Lists are now replacing arrays in most places.
|
|
Updated |
ICollector, ICollectorConfig, ICrawler, ICrawlerConfig were all
replaced with Collector, CollectorConfig, Crawler, and CrawlerConfig.
|
|
Updated |
Default working directory structure has been modified.
|
|
Updated |
Path is used in addition/instead of File in many places.
|
|
Updated |
Configurable CollectorLifeCycleListener, IJobLifeCycleListener,
IJobErrorListener, ISuiteLifeCycleListener, ICrawlerEventListener
all replaced with IEventListener. These new listeners can be set on
the collector configuration, or be implemented on configuration objects
and automatically be detected.
|
|
Updated |
Dependency updates: Norconex Importer 3.0.0, Norconex JEF 5.0.0,
Norconex Commons Lang 2.0.0, Norconex Committer 3.0.0, H2 1.4.197.
|
|
Updated |
CrawlerConfig#OrphanStrategy is now public.
|
|
Updated |
Now requires Java 8 or higher.
|
|
Updated |
Command-line arguments are now different, with more options such
as "cleaning" previous executions,
importing/exporting the crawl store and forcing a commit of any remains
from committer queue, rendering of configuration file once interpreted,
etc.
|
|
Updated |
Now use simple file-locks to prevent running conflicting
commands concurrently.
|
|
Updated |
Dates now takes the zone into consideration.
|
|
Updated |
Collector "maxParallelCrawlers" is now deprecated in favor of
"maxConcurrentCrawlers".
|
|
Removed |
Removed "data" package in favor of "reference" package.
|
|
Removed |
Removed some of the deprecated code from 1.x.
|
|
Removed |
Removed CRAWLER_RESUMED crawler event.
|
|
Removed |
Removed CollectorConfigLoader, CollectorLifeCycleListener,
CrawlerLifeCycleListener, IJobLifeCycleListener, IJobErrorListener,
ISuiteLifeCycleListener, ICrawlerEventListener
(replaced by IEventListener).
|
|
Removed |
Remove all previously available crawl store implementions in favor
of new MVStoreDataStore.
|
|