Norconex Web Crawler

Migrate to version 3

If you are using this product for the first time, you can ignore this page and [get started] right away. If migrating from version 2.x, you need to be aware of these (API breaking) changes.

Changes

Removed/Deprecated Replacement/New Notes
Java 7 support Java 8 or greater Java 8 is now the minimum supported Java version.
<crawler>/<workDir> <collector>/<workDir>

The working directory is now configured at the collector level and all created artifacts are derived from it unless otherwise stated. A sub-directory is created for the matching collector id, which in turn will contain sub-directories for each crawler id defined.

It is now be possible to have multiple collector/crawler configurations point to the same workdir directory without path collisions.

<progressDir> None The location of progress-related files is now derived from the collector working directory and is no longer a configurable element.
<logsDir>
Log4j
SLF4J Logging is now done with SLF4J and is no longer controlled by the Web Crawler. By default, logs are now printed to STDOUT using Log4j2. You can configure Log4j yourself to store to file instead, or you can use a different logging implementation altogether.
<metadataFetcher>
<httpClientFactory>
<documentFetcher>
IHttpMetadataFetcher
GenericMetadataFetcher
IHttpClientFactory
GenericHttpClientFactory
IHttpDocumentFetcher
GenericDocumentFetcher
PhantomJSDocumentFetcher
<httpFetcher>
IHttpFetcher
GenericHttpFetcher
WebDriverHttpFetcher

The concepts of an HTTP "client" and a document/metadata fetcher has been merged into IHttpFetcher. It is possible to specify multiple HTTP fetchers for different types of URLs.

It is recommended to use GenericHttpFetcher for best performance.

WebDriverHttpFetcher allows the use of popular web browsers (e.g., Chrome, Firefox) to interpret JavaScript-driven web pages.

<collectorListeners>
<crawlerListeners>
<jobLifeCycleListeners>
<jobErrorListeners>
<suiteLifeCycleListeners>
ICollectorLifeCycleListener
ICrawlerEventListener
IJobLifeCycleListener
IJobErrorListener
ISuiteLifeCycleListener
<eventListeners> IEventListener

The event management has been simplified. All events now implement IEventListener and can be registered in multiple places.

Configuration classes or any of their member variables implementing IEventListener will automatically get registered on startup. This allows developers to listen for any type of events from any configuration class.

JEF listeners previously reserved for advanced use no longer have their own configuration section. JEF events are propagated to all listeners interested in them.

CRAWLER_STARTED
CRAWLER_RESUMED
CRAWLER_FINISHED
CRAWLER_STOPPING
CRAWLER_STOPPED
COLLECTOR_RUN_BEGIN
COLLECTOR_RUN_END
COLLECTOR_STOP_BEGIN
COLLECTOR_STOP_END
COLLECTOR_CLEAN_BEGIN
COLLECTOR_CLEAN_END
COLLECTOR_STORE_EXPORT_BEGIN
COLLECTOR_STORE_EXPORT_END
COLLECTOR_STORE_IMPORT_BEGIN
COLLECTOR_STORE_IMPORT_END
COLLECTOR_ERROR
CRAWLER_INIT_BEGIN
CRAWLER_INIT_END
CRAWLER_RUN_BEGIN
CRAWLER_RUN_END
CRAWLER_RUN_THREAD_BEGIN
CRAWLER_RUN_THREAD_END
CRAWLER_STOP_BEGIN
CRAWLER_STOP_END
CRAWLER_CLEAN_BEGIN
CRAWLER_CLEAN_END
IMPORTER_HANDLER_BEGIN
IMPORTER_HANDLER_END
IMPORTER_HANDLER_ERROR
IMPORTER_HANDLER_CONDITION_TRUE
IMPORTER_HANDLER_CONDITION_FALSE
IMPORTER_PARSER_BEGIN
IMPORTER_PARSER_END
IMPORTER_PARSER_ERROR
As described earlier, listeners of all kind were merged into a more generic listener implementation. Instead of dealing with many listeners, more events are fired.
<userAgent> See notes The user agent is now provided by IHttpFetcher implementations. It is up to each implementation to have the user-agent configurable or not.
<filter>
<tagger>
<splitter>
<transformer>
<handler> All handlers from the Importer module are now configured using the <handler> tag.
IXMLConfigurable:
  #loadFromXML(Reader)
  #saveToXML(Writer)
IXMLConfigurable:
  #loadFromXML(XML)
  #saveToXML(XML)
IXMLConfigurable now deals with "XML" objects. Alternatively XML configuration can now be achieved using JAXB.
<crawlDataStoreEngine>
ICrawlDataStoreEngine
<dataStoreEngine>
IDataStoreEngine
The data "store" can now be used generically for implementors to store anything (not just URL-related crawl data).
ICrawlData CrawlDocInfo Renamed.
<sitemapResolverFactory>
ISitemapResolverFactory
StandardSitemapResolverFactory
StandardSitemapResolver
SitemapURLAdder
SitemapStore
<sitemapResolver>
ISitemapResolver
GenericSitemapResolver
No longer need to create a factory and the default implementation no longer create a sitemap-specific database.
overwrite="..."
onConflict="..."
...#setOverwrite(boolean)
...#onConflict(OnConflict)
onSet="..."
...#onSet(PropertySetter)
Wherever you could specify whether to overwrite existing values or not when dealing with new values, you now use "onSet" instead, which provides the following options: append, prepend, replace, optional. The "onSet" option is also available in many more configurable classes.
Various text-matching configuration options. <textMatcher>
    ignoreCase="..."
    ignoreDiacritic="..."
    partial="..."
    method="...">
  (expression)
</textMatcher>
Most places where you could specify a regular expression, you can now specify which type of match and replace to use: basic, wildcard, regex, csv. You can also dictate whether a match should match the entire value and ignore diacritical marks (e.g., accents) in addition to character case.
<restrictTo
    field="...">
    caseSensitive="..."
  (regular expression)
</restrictTo>
<if>
  <condition>
    ...
  </condition>
  <then>
    <handler .../>
  </then>
  <else>
    <handler .../>
  </else>
</if>
The ability to restrict Importer handler execution to certain conditions has been much enhanced. Instead of using "restricTo" within an handler configuraiton, you can now use simple XML-based "flow" syntax outside of an handler to guide the crawler. New tags were introduced: if, ifNot, condition, then, else, reject.