Norconex Web Crawler
What's new in version 3
Version 3 is a major release bringing a slew of new features.
The following highlights some of the most important additions and changes.
If you are upgrading from version 2.x, make sure to check this
migration guide.
General Use
- It is now possible to use your favorite browser (Chrome, Firefox, etc.)
in "headless" mode to crawl more sophisticated pages
(e.g., JavaScript-generated).
- You can now configure multiple committers per crawlers.
- When unresolved the traditional way, variables in configuration files
will be resolved against system properties and environment variables
(in that order) .
- You can now use partial class names in your configuration
and the Collector will try to automatically detect its full name. E.g.,
com.norconex.collector.core.filter.impl.ExtensionReferenceFilter
can be written simply as filter.impl.ExtensionReferenceFilter
or even just ExtensionReferenceFilter
.
- New command line arguments to:
- clean any remains of a previous crawl
- import/export the crawl store
- New [ImageTransformer] for performing simple operations on crawled
images (convert format, resize, etc.).
- New out-of-the-box "CSV" Committer.
- New out-of-the-box LogCommitter and MemoryCommitter to facilitate
testing.
- In many places, you can now specify how metadata values are set/added.
- Simplified event management with many new events.
- Configured event listeners are automatically registered.
- More...
Java developers
- Can use JAXB in addition to IXMLConfigurable to map configuration
to objects.
- New crawler events and improved listening options.
- New IDataStoreEngine accessible from crawler to store any kind
of objects between crawls by implementors in their own extensions.
- Ability to use custom logging implementation (via SLF4J).
- Can safely use Java 8 or higher (Java 8 now the minimum Java
version required).
- The HTTP client has been asbstracted and you can now add your custom
implementation.
- More...
Have a look at release notes for more information