If you are using this product for the first time, you can ignore this page and [get started] right away. If migrating from version 2.x, you need to be aware of these (API breaking) changes.
|Java 7 support||Java 8 or greater||Java 8 is now the minimum supported Java version.|
The working directory is now configured at the collector level and all created artifacts are derived from it unless otherwise stated. A sub-directory is created for the matching collector id, which in turn will contain sub-directories for each crawler id defined.
It is now be possible to have multiple collector/crawler configurations point to the same workdir directory without path collisions.
|<progressDir>||None||The location of progress-related files is now derived from the collector working directory and is no longer a configurable element.|
|SLF4J||Logging is now done with SLF4J and is no longer controlled by the Web Crawler. By default, logs are now printed to STDOUT using Log4j2. You can configure Log4j yourself to store to file instead, or you can use a different logging implementation altogether.|
The concepts of an HTTP "client" and a document/metadata fetcher has been merged into IHttpFetcher. It is possible to specify multiple HTTP fetchers for different types of URLs.
It is recommended to use GenericHttpFetcher for best performance.
The event management has been simplified. All events now implement IEventListener and can be registered in multiple places.
Configuration classes or any of their member variables implementing IEventListener will automatically get registered on startup. This allows developers to listen for any type of events from any configuration class.
JEF listeners previously reserved for advanced use no longer have their own configuration section. JEF events are propagated to all listeners interested in them.
|As described earlier, listeners of all kind were merged into a more generic listener implementation. Instead of dealing with many listeners, more events are fired.|
|<userAgent>||See notes||The user agent is now provided by IHttpFetcher implementations. It is up to each implementation to have the user-agent configurable or not.|
|<handler>||All handlers from the Importer module are now configured using the <handler> tag.|
|IXMLConfigurable now deals with "XML" objects. Alternatively XML configuration can now be achieved using JAXB.|
|The data "store" can now be used generically for implementors to store anything (not just URL-related crawl data).|
|No longer need to create a factory and the default implementation no longer create a sitemap-specific database.|
|Wherever you could specify whether to overwrite existing values or not when dealing with new values, you now use "onSet" instead, which provides the following options: append, prepend, replace, optional. The "onSet" option is also available in many more configurable classes.|
|Various text-matching configuration options.||
|Most places where you could specify a regular expression, you can now specify which type of match and replace to use: basic, wildcard, regex, csv. You can also dictate whether a match should match the entire value and ignore diacritical marks (e.g., accents) in addition to character case.|
The ability to restrict Importer handler execution to certain
conditions has been much enhanced. Instead of using "restricTo" within
an handler configuraiton, you can now use simple XML-based
"flow" syntax outside of an handler to guide the crawler.
New tags were introduced: