Norconex Web Crawler

What's new in version 3

Version 3 is a major release bringing a slew of new features. The following highlights some of the most important additions and changes. If you are upgrading from version 2.x, make sure to check this migration guide.

General Use

  • It is now possible to use your favorite browser (Chrome, Firefox, etc.) in "headless" mode to crawl more sophisticated pages (e.g., JavaScript-generated).
  • You can now configure multiple committers per crawlers.
  • When unresolved the traditional way, variables in configuration files will be resolved against system properties and environment variables (in that order) .
  • You can now use partial class names in your configuration and the Collector will try to automatically detect its full name. E.g., com.norconex.collector.core.filter.impl.ExtensionReferenceFilter can be written simply as filter.impl.ExtensionReferenceFilter or even just ExtensionReferenceFilter.
  • New command line arguments to:
    • clean any remains of a previous crawl
    • import/export the crawl store
  • New [ImageTransformer] for performing simple operations on crawled images (convert format, resize, etc.).
  • New out-of-the-box "CSV" Committer.
  • New out-of-the-box LogCommitter and MemoryCommitter to facilitate testing.
  • In many places, you can now specify how metadata values are set/added.
  • Simplified event management with many new events.
  • Configured event listeners are automatically registered.
  • More...

Java developers

  • Can use JAXB in addition to IXMLConfigurable to map configuration to objects.
  • New crawler events and improved listening options.
  • New IDataStoreEngine accessible from crawler to store any kind of objects between crawls by implementors in their own extensions.
  • Ability to use custom logging implementation (via SLF4J).
  • Can safely use Java 8 or higher (Java 8 now the minimum Java version required).
  • The HTTP client has been asbstracted and you can now add your custom implementation.
  • More...
Have a look at release notes for more information