Version | Date | Description |
---|---|---|
1.10.1 | 2021-10-18 | Maintenance release |
1.10.0 | 2019-12-22 | Feature release |
1.9.1 | 2018-07-29 | Maintenance release |
1.9.0 | 2017-11-26 | Feature release |
1.8.2 | 2017-05-26 | Bugfix release |
1.8.1 | 2017-05-25 | Maintenance release |
1.8.0 | 2017-04-26 | Feature release |
1.7.0 | 2016-12-14 | Feature release |
1.6.0 | 2016-08-25 | Feature release |
1.5.0 | 2016-06-03 | Feature release |
1.4.0 | 2016-02-28 | Maintenance release |
1.3.0 | 2015-11-06 | Feature release |
1.2.1 | 2015-08-07 | Maintenance release |
1.2.0 | 2015-07-22 | Feature release |
1.1.0 | 2015-04-08 | Feature release |
1.0.2 | 2015-02-04 | Bug fix release |
1.0.1 | 2014-12-03 | Bug fix release |
1.0.0 | 2014-11-26 | Initial release |
Updated | Now logging crawler id with events. | |
Updated | Norconex JEF 4.1.3, Norconex Commons Lang 1.15.2 | |
Updated | New ICrawlDataStore#getProcessed(String) method. | |
Fixed | Fixed sometimes not "remembering" bad status and error when using "grace once". | #31 |
New | Added "unmanaged" attribute to "logsDir" configuration option to prevent the collector from managing its own file-based logging. | |
New | New "maxParallelCrawlers" collector configuration option. Allows to run only a maximum number of crawlers at any given time, queuing the others. | #25 |
New | Added SSL support to MongoDB crawl data store. | |
New | New AbstractCollector#getState() method. | |
New | Added advanced configuration parameters to MVStoreCrawlDataStoreFactory. | |
Updated | Maven dependency updates: Norconex Commons Lang 1.15.1, Norconex Importer 2.10.0, Norconex Committer Core 2.1.3, Norconex JEF 4.1.2, H2 1.4.199. | |
Fixed | SpoiledReferenceStrategy of GRACE_ONCE now properly delete a document on subsequent fail. | |
Fixed | Add a retry to MongoDB upserts to fix getting a constraint violation on concurrent upserts. | #24 |
Updated | Significant performance improvement on MongoCrawlDataStore#isQueueEmpty(). | |
Updated | Dependency updates: Norconex Importer 2.9.0, Norconex Commons Lang 1.15.0. | |
Updated | AbstractCrawler now logs documents it could not process as INFO. | |
Updated | MongoCrawlDataStore #buildMongoClient abd #buildMongoCredentials methods were moved to MongoConnectionDetails. | #15 |
Fixed | Fixed embedded document checksums creation pulling the wrong cached checksum causing them to always appear new when metadataChecksummer is disabled.. | |
Fixed | Fixed showing wrong path in error message when command-line variable file is invalid. | #16 |
Fixed | Fixed NullPointerException under some conditions for AbstractCrawlerConfig#saveToXML(...). |
New | New "sourceFieldsRegex" option on GenericMetadataChecksummer and MD5DocumentChecksummer allowing the use of regular expressions to match the fields to use for building the checksum. | |
New | New "combineFieldsAndContent" option on MD5DocumentChecksummer to use both fields and content for building the checksum. | |
New | Can now specify custom collection names when using MongoCrawlDataStore and AbstractMongoCrawlDataStoreFactory implementations. | |
New | New "stopOnExceptions" added to crawler configuration to force crawler to stop upon encountering a specified exceptions. | |
Updated | The MongoCrawlDataStore now accepts references longer than 1024 characters. | |
Updated | AbstractCrawler no longer create work directory on object construction, but rather does it when the crawler starts. | |
Updated | Dependency updates: Norconex Importer 2.8.0, Norconex Commons Lang 1.14.0, Norconex Committer Core 2.1.2, Apache Commons DbUtils 1.7, MongoDB Java Driver 3.5.0, H2 Database 1.4.196. | |
Fixed | When orphan strategy is "PROCESS", the crawler now always attempts to process a document, regardless of sitemap delays or recrawlable delays, since the reason for it to become orphan may be deletion, and we do not want to wait a future crawl cycle to find out. |
Updated | Dependency updates: Norconex Importer 2.7.2. | |
Fixed | Fixed "caseSensitive" flag sometimes having no effect in RegexMetadataFilter and RegexReferenceFilter. |
New | MongoCrawlDataStore now support specifying the MongoDB authentication mechanism to use (MONGODB-CR or SCRAM-SHA-1). | |
Updated | Classes related to MongoDB crawl store implementation were updated to use MongoDB 3.x API. | |
Updated | Dependency updates: Norconex Importer 2.7.1, Norconex Committer Core 2.1.1, Mongodb Driver 3.4.2, Fongo 2.0.13 (for tests). | |
Updated | AbstractCollector#saveToXML(...) now written with xml:space="preserve". | |
Fixed | Fixed "importer" config section not being inherited from "crawlerDefaults" when a specific crawler configuration does not declare one. |
New | Added schema-based XML configuration validation which can be trigged on command prompt with this new flag: -k or --checkcfg | |
New | New ICollectorLifeCycleListener interface that can be added on the collector configuration to be notified and take action when the collector starts and stops. | |
New | Two new crawler events where added for crawler event listeners: CRAWLER_STOPPING and CRAWLER_STOPPED. | |
New | AbstractMongoCrawlDataStoreFactory now accepts encrypted passwords. | |
New | Now distributed with utility scripts. | |
Updated | Crawler events REJECTED_FILTER, REJECTED_BAD_STATUS, REJECTED_IMPORT, and REJECTED_ERROR are now DEBUG in log4j.properties. | |
Updated | When their log level is DEBUG, the word "Subject:" has been removed form crawler event messages and "No additional information available." is shown when there is no extra info to show. | |
Updated | Dependency updates: Norconex Commons Lang 1.13.0, Norconex Importer 2.7.0, Norconex JEF API 4.1.0, Norconex Committer Core 2.1.0, JSoup 1.10.2. | |
Updated | Modified Javadoc to include an XML usage example for all XML-configurable classes. | |
Updated | ICrawlerConfig no longer implements Cloneable. | |
Updated | Document, metadata, and reference filters now logs appropriate message when there is no "include" match, when log level is DEBUG. | |
Fixed | Fixed crawler defaults not always being applied as it should. | |
Fixed | Fixed minor errors in writing IXMLConfigurable classes to XML. | |
Fixed | Throwable exceptions no longer makes a crawler hang under certain conditions when importing/parsing a file. | |
Removed | Removed code deprecated in version 1.2 or older. | |
Removed | Removed MapDB and Apache Derby crawlstore dependencies/implementations which were deprecated in version 1.6. |
New | It is now possible to add JEF-related listeners on the collector configuration. | |
Updated | JXM support is not disabled by default to improve performance. It can be enabled by adding the JVM argument : -DenableJMX=true | |
Updated | Dependency updates: Norconex Commons Lang 1.12.3, Norconex Importer 2.6.1, Norconex JEF API 4.0.8, Joda Time 2.9.4, JJ2000 5.3, Apache HTTP Client 4.5.2, Apache HTTP Core 4.4.5, Apache Commons Logging 1.2 | |
Fixed | Fixed NullPointerException when stopping a crawler that did not previously run. |
New | New "checkcfg" launch action that will load a configuration without doing anything with it (to help resolve config issues). | |
New | New CrawlState#isSkipped() method to indicate if a document was unmodified or premature. | |
New | New AbstractCrawler#beforeFinalizeDocumentProcessing() method to let crawler implementations act on a document before it is being finalized. | |
Updated | MVStoreCrawlDataStoreFactory is now the default crawl store factory (replacing now deprecated MapDB implementation). | |
Updated | Dependency updates: Norconex Importer 2.6.0, Norconex Committer Core 2.0.5, JSoup 1.9.2, Apache Commons DBCP 2.1.1, H2 Database 1.4.192. | |
Updated | API break: method signature changed for AbstractCrawler from applyCrawlData(ICrawlData crawlData, ImporterDocument document) to initCrawlData(ICrawlData crawlData, ICrawlData cachedCrawlData, ImporterDocument document). |
New | New BasicJDBCCrawlDataStoreFactory implementation for collector implementations with basic crawl storage needs. | |
New | New document crawl state: PREMATURE. | |
New | New crawler event: REJECTED_PREMATURE. | |
Updated | Default database implementation for AbstractJDBCDataStoreFactory when invoked with an empty constructor is now H2. | |
Updated | When provided by collectors, document "crawl date" and content type can be added to the crawl data and will be stored in the crawl data store (affects all ICrawlDataStoreFactory implementations). | |
Updated | Dependency updates: Norconex Importer 2.5.2, MapDB 1.0.9, H2 1.4.191, Fongo 1.6.2. | |
Updated | Event string value for DOCUMENT_COMMITTED_REMOVE changed from DOCUMENT_COMMITTED_REMOV to DOCUMENT_COMMITTED_REMOVE. |
Updated | Dependency updates: Norconex Importer 2.5.0. | |
Updated | ExtensionReferenceFilter is now smarter at detecting extension. | #2 |
Updated | ExtensionReferenceFilter now allows white spaces around extensions in XML config. |
Updated | Specifying an invalid path on the command-line for the config file or variable file now returns a meaningful message. | |
Updated | Maven direct dependency updates: Norconex Importer 2.4.0, Norconex JEF 4.0.7, Mongo Java Driver 2.13.3, Apache Derby 10.12.1.1. | |
Updated | Now logs (leve INFO) a less alarming message when a module version cannot be found. | |
Updated | Now logs module version information in file. | |
Updated | A new metadata boolean field called "collector.is-crawl-new" is now added before document importing. It indicates whether the document is already known from the crawler, from a previous run. | |
Updated | Cached instance of a reference data is now passed around as opposed to being obtained form the reference cache each time it is needed. | |
Updated | Saved and loaded configuration-related classes are now equal. Methods equals/hashCode/toString for those classes are now implemented uniformly and where added where missing. | |
Fixed | Fixed some configuration classes not always being saved to XML properly or giving errors. | |
Fixed | Fixed IOException when "keepDownloads" is true. This was occurring for URLs with no path (just the host name). Now prefixes created domain directory domain file with "d." and "f." respectively. |
Updated | AbstractCrawler is no longer deleting remaining orphans after they have been processed (when orphan strategy is PROCESS). | |
Updated | Verbose logging in AbstractCrawler#processNextReference(...) has been changed from loglevel DEBUG to TRACE. | |
Updated | Dependency updates: Norconex Importer 2.3.1 and Norconex Committer Core 2.0.2. |
New | New configurable option: ISpoiledStateStategyResolver. It allows one to customize what strategy to adopt when a reference is in a bad crawl state (ignore, delete, or grace once). A default implementation is provided: GenericSpoiledStateStrategyResolver. | |
New | New GenericMetadataChecksummer for choosing one or many metadata fields and their values to create a checksum. | |
New | Now printing release versions of Norconex libraries used when a collector is launched. | |
New | New NOT_FOUND state constant added to CrawlState (migrated from the HTTP Collector). | |
Updated | AbstractCrawler is now firing REJECTED_ERROR events when an exception prevented proper processing of a reference. | |
Updated | Documents with a bad crawl state other than "NOT_FOUND" are now given once chance to recover before a deletion request gets sent. This can be overwritten. | |
Updated | The OrphansStrategy default in crawler config is now PROCESS to get around cases where temporary conditions prevent accessing some documents that normally should (and should not avoid re-processing on incremental crawls). | |
Updated | MD5DocumentChecksummer#setField(String) has been deprecated in favor of MD5DocumentChecksummer#setFields(String...). | |
Updated | CrawlState#isCommittable() has been deprecated in favor of CrawlState#isNewOrModified(). | |
Updated | Setter methods signatures accepting an array in AbstractCrawlerConfig were updated to accept "varargs" instead (variable arguments). | |
Updated | Uses default port when no Mongo port is specified when using Mongo data store. | |
Updated | When the saving of documents is enabled, each saved documents is no longer printed to STDOUT but logged as a Log4j debug statement instead. | |
Updated | Regular expressions in RegexMetadataFilter and RegexReferenceFilter now always have the Pattern.DOTALL flag enabled and when case sensitivity is enabled for regex, Pattern.UNICODE_CASE is now always used. | |
Updated | Library updates: Norconex JEF 4.0.6, Norconex Importer 2.3.0, Norconex Commons Lang 1.6.2, Mongo Java Driver 2.13.2, H2 database 1.4.187. New dependency: JUnit 4.12 (test scope). | |
Updated | Jar manifest now includes implementation entries and specifications entries (matching Maven pom.xml). | |
Updated | Javadoc fixes and updates. | |
Fixed | Updated Mongo indexes to use stage instead of state. (Github collector-http#97). | |
Fixed | Stopping a job that has been resumed now works as expected. | |
Fixed | Stopping a job that has been resumed now works as expected. | |
Removed | ICrawlDataStore#isVanished(ICrawlData) has been deprecated. |
New | New methods and configuration attribute to disable checksum creation in MD5DocumentChecksummer. | |
Updated | Library updates: Norconex Committer Core 2.0.1, Norconex Importer 2.1.1, Norconex JEF 4.0.4, MapDB 1.0.7, Apache Commons BeanUtils 1.9.2, Apache Commons DBCP2 2.1, Mongo Java Driver 2.13.0, H2 1.4.186. | |
Updated | Added Sonatype repository to pom.xml for snapshot releases. | |
Updated | Updated several maven plugins and added SonarQube maven plugin. | |
Updated | Removed pom.xml dependency on Norconex Commons Lang, which is already provided by other dependencies. | |
Updated | Subject in event logging is now only shown on DEBUG log level. | |
Updated | The database XML configuration in AbstractJDBCDataStoreFactory is now case-insensitive. | |
Updated | H2 database now has a write delay of zero to ensure durability on JVM crash. | |
Updated | MapDB and MVStore implementation of ICrawlDataStore now forces a commit on every addition a the expense of performance to ensure durability on JVM/OS/System crash. | |
Fixed | BaseCrawlData#setDocumentChecksum(String) is now deprecated in favor of BaseCrawlData#setContentChecksum(String) to fix content checksum not being saved in crawl data store properly. | |
Fixed | Fixed NullPointerException when running an incremental crawl over one that previously failed due to invalid configuration. | |
Fixed | Fixed incremental run not always handling non-modified documents properly (sometimes deleting, sometimes re-adding). | |
Fixed | Fixed NPE in AbstractJDBCDataStoreFactory#createCrawlDataStore(...) when database is null. |
Fixed | When splitting documents, crawlers will now trigger individual processing/deletion of children/embedded documents that no longer exists on incremental runs (based on your "orphansStrategy" configuration). When deleting orphans, deletion of a parent document will also trigger deletion requests to its children/embedded documents. | |
Fixed | Fixed an infinite loop that sometime occurred when dealing with multiple threads and the configured maxDocument is reached (and greater than zero). This could prevent a collector from ever stopping. | |
Fixed | Fixed invalid detection of crawler execution state, affecting ability to stop a collector. | |
Fixed | Crawl data is no longer added to document metadata after the import phase (which could conflict with some handlers, like KeepOnlyTagger). | |
Updated | Default logging of Crawler events is now better aligned. | |
Updated | Updated JEF API to version 4.0.2. | |
Updated | Javadoc corrections. |
Fixed | When keepDownloads is true, saved files and directories are now prefixed with "f." and "d." respectively to avoid collisions. | |
Updated | Crawler id is now set on JEF JobSuite when a new thread starts to improve logging. | |
Updated | Upgraded norconex-jef to 4.0.1. |
New | Initial release. |