Norconex HTTP Collector

3.x Release Notes

Release History

Version Date Description
3.0.0-RC1 2021-10-10 Release Candidate 1.
3.0.0-M2 2021-07-28 Changes since last milestone for this upcoming major release.
3.0.0-M1 2021-03-01 Milestone towards next major release

3.0.0-RC1 Release Candidate 1. Release date 2021-10-10 Download

New Can now automatically stop a crawler when it reached a certain number of crawler events, thanks to the new Collector Core "StopCrawlerOnMaxEventListener" class. #172
New Can now send deletion requests to committers based on "rejected" events thanks to the new Collector Core "DeleteRejectedEventListener" class. #211
New Added optional deduplication feature based on metadata and/or document checksums. #579
New Added support for ETag/If-None-Match to GenericHttpFetcher. #182
New Added supports HTTP Strict Transport Security (HSTS) to GenericHttpFetcher. Active by default. #694
New Added support for restricting which HTTP method GenericHttpFetcher accepts. #654
New Added "DISABLED", "OPTIONAL", "REQUIRED" options for HTTP HEAD and GET methods. #654
New New DOMLinkExtractor repeatable "extractSelector" and "noExtractSelector".
Updated HttpCrawlerConfig metadata checksummer move to base class CrawlerConfig.
Updated DOMLinkExtractor configurable "dom" elements deprecated in favor of "linkSelector".
Updated Checksummers "disabled" flag deprecated in favor of setting a null checksummer or using a self-closed checksummer tag in config.
Fixed Fixed invalid configuration in POM "maven-dependency-plugin".

3.0.0-M2 Changes since last milestone for this upcoming major release. Release date 2021-07-28 Download

Removed Removed JEF dependency in favor of improved JMX for tracking.
Updated WebDriver HttpSniffer can now have its max buffer size configured. #751
Fixed Fixed NullPointerException when resolving sitemaps. #738
Fixed Fixed sitemap.xml URL entries not being extracted when they contain custom elements. #758

3.0.0-M1 Milestone towards next major release Release date 2021-03-01 Download

New New IHttpFetcher for making HTTP requests. Multiple instances can now be specified and tried in sequence. This replaces IHttpClientFactory, IHttpDocumentFetcher, and IHttpMetadataFetcher.
New New WebDriverHttpFetcher for using popular browsers in headless mode (Chrome, Firefox, ...). Ideal for Javascript-driven websites and taking screenshots.
New Now supports If-Modified-Since for more efficient crawling. #637
New New flag for loading start URLs asynchronously.
New New HttpCollectorEvent.
New New crawler event: URLS_POST_IMPORTED (in addition to new events from Collector Core).
New New GenericHttpFetcher, replacing GenericHttpClientFactory and GenericHttpDocumentFetcher.
New New "disableSNI" crawler configuration option to disable Server Name Indication. #577
New New DOMLinkExtractor using JSoup to extract links.
New Link extractors can now extracting links from metadata fields in addition to content.
New New "postImportLinks" configuration option for considering links from metadata fields created during import for crawling (providing a way to extra links from parsed binaries). #428
New New DOMLinkExtractor that uses JSoup to extract links from HTML/XML documents. #668
New New HttpDoc, HttpDocInfo, HttpDocMetadata (new or renamed).
New New metadata field set when URL changes from normalization: "collector.originalReference".
New Form authentication can now parse and submit HTML forms (taking login page URL instead of form action URL).
Updated PhantomJSDocumentFetcher now deprecated in favor of WebDriverHttpFetcher.
Updated Now using XML class from Norconex Commons Lang for loading/saving configuration.
Updated User-Agent no longer set directly on crawler config. It can be set on IHttpFetcher implementations that support it.
Updated Now using SLF4J for logging.
Updated HtmlLinkExtractor (and DOMLinkExtractor) now extracts all attributes of referrer links and adds them to the target document metadata. This can be disabled with the #setIgnoreLinkData() method. #668
Updated Dependency updates: Norconex Collector Core 2.0.0, Jetty 9.4.12.v20180830.
Updated Now requires Java 8 or higher.
Updated Lists are now replacing arrays in most places.
Updated Path is used in addition/instead of File in many places.
Updated Default working directory structure has been improved.
Updated Improved handling of "trustAllSSLCertificate" in GenericHttpFetcher by auto-accepting/storing certificates. #592
Updated Dates handling now consider the zone.
Updated Can now control whether to store as metadata all extracted links, just in-scope ones, out-of-scope, both, or none.
Updated Link extractors and canonical URL classes are now in their own distinct packages.
Updated GenericLinkExtractor renamed to HtmlLinkExtractor.
Updated Sitemap information now stored in the new data store engine as opposed to its own storage mechanism.
Updated URLStatusCrawlerEventListener now saves as CSV instead of TSV.
Updated HttpCrawlerEvent now only holds HTTP crawler event names. Crawler events are now all of CrawlerEvent type.
Removed Removed some code deprecated in releases before 3.0.0.
Removed IHttpClientFactory, IHttpDocumentFetcher, and IHttpMetadataFetcher and implementations were removed in favor of IHttpFetcher and GenericHttpFetcher.
Removed Removed "data" package (and its classes) in favor of new classes in "doc" package.
Removed JDBC crawl store implementation has been removed in favor of NoSQL only (MVStore + MongoDB).
Removed ISitemapResolverFactory removed in favor of ISitemapResolver.
Removed Removed HttpCollectorEvent (now relying on CollectorEvent only).