If you are using this product for the first time, you can ignore this page and [get started] right away. If migrating from version 2.x, you need to be aware of these (API breaking) changes.
Removed/Deprecated | Replacement/New | Notes |
---|---|---|
Java 7 support | Java 8 or greater | Java 8 is now the minimum supported Java version. |
<crawler>/<workDir> | <collector>/<workDir> |
The working directory is now configured at the collector level and all created artifacts are derived from it unless otherwise stated. A sub-directory is created for the matching collector id, which in turn will contain sub-directories for each crawler id defined. It is now be possible to have multiple collector/crawler configurations point to the same workdir directory without path collisions. |
<progressDir> | None | The location of progress-related files is now derived from the collector working directory and is no longer a configurable element. |
<logsDir> Log4j |
SLF4J | Logging is now done with SLF4J and is no longer controlled by the Web Crawler. By default, logs are now printed to STDOUT using Log4j2. You can configure Log4j yourself to store to file instead, or you can use a different logging implementation altogether. |
<metadataFetcher> <httpClientFactory> <documentFetcher> IHttpMetadataFetcher GenericMetadataFetcher IHttpClientFactory GenericHttpClientFactory IHttpDocumentFetcher GenericDocumentFetcher PhantomJSDocumentFetcher |
<httpFetcher> IHttpFetcher GenericHttpFetcher WebDriverHttpFetcher |
The concepts of an HTTP "client" and a document/metadata fetcher has been merged into IHttpFetcher. It is possible to specify multiple HTTP fetchers for different types of URLs. It is recommended to use GenericHttpFetcher for best performance. WebDriverHttpFetcher allows the use of popular web browsers (e.g., Chrome, Firefox) to interpret JavaScript-driven web pages. |
<collectorListeners> <crawlerListeners> <jobLifeCycleListeners> <jobErrorListeners> <suiteLifeCycleListeners> ICollectorLifeCycleListener ICrawlerEventListener IJobLifeCycleListener IJobErrorListener ISuiteLifeCycleListener |
<eventListeners>
IEventListener |
The event management has been simplified. All events now implement IEventListener and can be registered in multiple places. Configuration classes or any of their member variables implementing IEventListener will automatically get registered on startup. This allows developers to listen for any type of events from any configuration class. JEF listeners previously reserved for advanced use no longer have their own configuration section. JEF events are propagated to all listeners interested in them. |
CRAWLER_STARTED CRAWLER_RESUMED CRAWLER_FINISHED CRAWLER_STOPPING CRAWLER_STOPPED |
COLLECTOR_RUN_BEGIN COLLECTOR_RUN_END COLLECTOR_STOP_BEGIN COLLECTOR_STOP_END COLLECTOR_CLEAN_BEGIN COLLECTOR_CLEAN_END COLLECTOR_STORE_EXPORT_BEGIN COLLECTOR_STORE_EXPORT_END COLLECTOR_STORE_IMPORT_BEGIN COLLECTOR_STORE_IMPORT_END COLLECTOR_ERROR CRAWLER_INIT_BEGIN CRAWLER_INIT_END CRAWLER_RUN_BEGIN CRAWLER_RUN_END CRAWLER_RUN_THREAD_BEGIN CRAWLER_RUN_THREAD_END CRAWLER_STOP_BEGIN CRAWLER_STOP_END CRAWLER_CLEAN_BEGIN CRAWLER_CLEAN_END IMPORTER_HANDLER_BEGIN IMPORTER_HANDLER_END IMPORTER_HANDLER_ERROR IMPORTER_HANDLER_CONDITION_TRUE IMPORTER_HANDLER_CONDITION_FALSE IMPORTER_PARSER_BEGIN IMPORTER_PARSER_END IMPORTER_PARSER_ERROR |
As described earlier, listeners of all kind were merged into a more generic listener implementation. Instead of dealing with many listeners, more events are fired. |
<userAgent> | See notes | The user agent is now provided by IHttpFetcher implementations. It is up to each implementation to have the user-agent configurable or not. |
<filter> <tagger> <splitter> <transformer> |
<handler> | All handlers from the Importer module are now configured using the <handler> tag. |
IXMLConfigurable: #loadFromXML(Reader) #saveToXML(Writer) |
IXMLConfigurable: #loadFromXML(XML) #saveToXML(XML) |
IXMLConfigurable now deals with "XML" objects. Alternatively XML configuration can now be achieved using JAXB. |
<crawlDataStoreEngine> ICrawlDataStoreEngine |
<dataStoreEngine> IDataStoreEngine |
The data "store" can now be used generically for implementors to store anything (not just URL-related crawl data). |
ICrawlData | CrawlDocInfo | Renamed. |
<sitemapResolverFactory> ISitemapResolverFactory StandardSitemapResolverFactory StandardSitemapResolver SitemapURLAdder SitemapStore |
<sitemapResolver> ISitemapResolver GenericSitemapResolver |
No longer need to create a factory and the default implementation no longer create a sitemap-specific database. |
overwrite="..." onConflict="..." ...#setOverwrite(boolean) ...#onConflict(OnConflict) |
onSet="..." ...#onSet(PropertySetter) |
Wherever you could specify whether to overwrite existing values or not when dealing with new values, you now use "onSet" instead, which provides the following options: append, prepend, replace, optional. The "onSet" option is also available in many more configurable classes. |
Various text-matching configuration options. |
<textMatcher> ignoreCase="..." ignoreDiacritic="..." partial="..." method="..."> (expression) </textMatcher> |
Most places where you could specify a regular expression, you can now specify which type of match and replace to use: basic, wildcard, regex, csv. You can also dictate whether a match should match the entire value and ignore diacritical marks (e.g., accents) in addition to character case. |
<restrictTo field="..."> caseSensitive="..." (regular expression) </restrictTo> |
<if> <condition> ... </condition> <then> <handler .../> </then> <else> <handler .../> </else> </if> |
The ability to restrict Importer handler execution to certain
conditions has been much enhanced. Instead of using "restricTo" within
an handler configuraiton, you can now use simple XML-based
"flow" syntax outside of an handler to guide the crawler.
New tags were introduced:
if, ifNot, condition, then, else, reject .
|