Norconex Web Crawler

3.x Release Notes

Release History

Version Date Description
3.1.0-SNAPSHOT 2024-??-?? Minor release.
3.0.2 2023-07-09 Maintenance release.
3.0.1 2022-08-30 Maintenance release.
3.0.0 2022-01-05 Major release. NOT a drop-in replacement for 2.x.

3.1.0-SNAPSHOT Minor release. Release date 2024-??-?? Download

This release is currently in development and the following information may change.
New New "removeTrailingFragment" option on GenericURLNormalizer. #1061
New New WebDriverHttpFetcher browser options: CUSTOM.
New Can now configure HttpSniffer "host" with WebDriverHttpFetcher.
New Now supports chained proxy when using HttpSniffer.
New Can now pass arguments to WebDriverHttpFetcher browser (in addition to regular WebDriver capabilities). #844
Updated Now requires Java 11 to run.
New Can now specify more than one URL Normalizer. #1063
Updated Maven dependency updates: Guava 33.0.0, imgscalr 4.2, Jetty 9.4.54.v20240208, Netty 4.1.96.Final, Selenium 4.23.0, Apache Commons Lang 3.14.0.
Fixed Fixed stop command causing error. #1067

3.0.2 Maintenance release. Release date 2023-07-09 Download

New Fixed GenericSitemapResolver NPE when the sitemap content-type could not be detected. #803
Updated Maven dependency updates: norconex-commons-maven-parent 1.0.2, norconex-collector-core 2.0.2, norconex-importer 3.0.1, Guava 32.0.0-jre, Selenium 4.0.0, Jetty 9.4.51.v20230217.

3.0.1 Maintenance release. Release date 2022-08-30 Download

New New MDC attributes which can be used in supporting logging framework: "ctx:crawler.id", "ctx:crawler.id.safe", "ctx:collector.id.safe", and "ctx:collector.id.safe". #790
Fixed Fixed invalid relative URLs resolution when such URLs contain a colon (:) that is not the scheme. #788
Fixed Fixed not always considering effective top level domain properly in HSTS resolution. #785
Fixed Fixed occasional concurrency issue when crawler terminates. #781
Fixed Fixed the crawler sometimes not exiting when done.

3.0.0 Major release. NOT a drop-in replacement for 2.x. Release date 2022-01-05 Download

Updated Updated transitive dependencies with known vulnerabilities.
Fixed Fixed data store engine errors (via "collector-core" dependency update). #766
Fixed Fixed "Connection Pool Shut Down" error (from collector-core update). #770
New Can now automatically stop a crawler when it reached a certain number of crawler events, thanks to the new Collector Core "StopCrawlerOnMaxEventListener" class. #172
New Can now send deletion requests to Committers based on "rejected" events thanks to the new Collector Core "DeleteRejectedEventListener" class. #211
New Added optional deduplication feature based on metadata and/or document checksums. #579
New Added support for ETag/If-None-Match to GenericHttpFetcher. #182
New Added supports HTTP Strict Transport Security (HSTS) to GenericHttpFetcher. Active by default. #694
New Added support for restricting which HTTP method GenericHttpFetcher accepts. #654
New Added "DISABLED", "OPTIONAL", "REQUIRED" options for HTTP HEAD and GET methods. #654
New New DOMLinkExtractor repeatable "extractSelector" and "noExtractSelector".
Updated HttpCrawlerConfig metadata checksummer move to base class CrawlerConfig.
Updated DOMLinkExtractor configurable "dom" elements deprecated in favor of "linkSelector".
Updated Checksummers "disabled" flag deprecated in favor of setting a null checksummer or using a self-closed checksummer tag in config.
Fixed Fixed invalid configuration in POM "maven-dependency-plugin".
Removed Removed JEF dependency in favor of improved JMX for tracking.
Updated WebDriver HttpSniffer can now have its max buffer size configured. #751
Fixed Fixed NullPointerException when resolving sitemaps. #738
Fixed Fixed sitemap.xml URL entries not being extracted when they contain custom elements. #758
New New IHttpFetcher for making HTTP requests. Multiple instances can now be specified and tried in sequence. This replaces IHttpClientFactory, IHttpDocumentFetcher, and IHttpMetadataFetcher.
New New WebDriverHttpFetcher for using popular browsers in headless mode (Chrome, Firefox, ...). Ideal for Javascript-driven websites and taking screenshots.
New Now supports If-Modified-Since for more efficient crawling. #637
New New flag for loading start URLs asynchronously.
New New HttpCollectorEvent.
New New crawler event: URLS_POST_IMPORTED (in addition to new events from Collector Core).
New New GenericHttpFetcher, replacing GenericHttpClientFactory and GenericHttpDocumentFetcher.
New New "disableSNI" crawler configuration option to disable Server Name Indication. #577
New New DOMLinkExtractor using JSoup to extract links.
New Link extractors can now extracting links from metadata fields in addition to content.
New New "postImportLinks" configuration option for considering links from metadata fields created during import for crawling (providing a way to extra links from parsed binaries). #428
New New DOMLinkExtractor that uses JSoup to extract links from HTML/XML documents. #668
New New HttpDoc, HttpDocInfo, HttpDocMetadata (new or renamed).
New New metadata field set when URL changes from normalization: "collector.originalReference".
New Form authentication can now parse and submit HTML forms (taking login page URL instead of form action URL).
Updated PhantomJSDocumentFetcher now deprecated in favor of WebDriverHttpFetcher.
Updated Now using XML class from Norconex Commons Lang for loading/saving configuration.
Updated User-Agent no longer set directly on crawler config. It can be set on IHttpFetcher implementations that support it.
Updated Now using SLF4J for logging.
Updated HtmlLinkExtractor (and DOMLinkExtractor) now extracts all attributes of referrer links and adds them to the target document metadata. This can be disabled with the #setIgnoreLinkData() method. #668
Updated Dependency updates: Norconex Collector Core 2.0.0, Jetty 9.4.12.v20180830.
Updated Now requires Java 8 or higher.
Updated Lists are now replacing arrays in most places.
Updated Path is used in addition/instead of File in many places.
Updated Default working directory structure has been improved.
Updated Improved handling of "trustAllSSLCertificate" in GenericHttpFetcher by auto-accepting/storing certificates. #592
Updated Dates handling now consider the zone.
Updated Can now control whether to store as metadata all extracted links, just in-scope ones, out-of-scope, both, or none.
Updated Link extractors and canonical URL classes are now in their own distinct packages.
Updated GenericLinkExtractor renamed to HtmlLinkExtractor.
Updated Sitemap information now stored in the new data store engine as opposed to its own storage mechanism.
Updated URLStatusCrawlerEventListener now saves as CSV instead of TSV.
Updated HttpCrawlerEvent now only holds HTTP crawler event names. Crawler events are now all of CrawlerEvent type.
Removed Removed some code deprecated in releases before 3.0.0.
Removed IHttpClientFactory, IHttpDocumentFetcher, and IHttpMetadataFetcher and implementations were removed in favor of IHttpFetcher and GenericHttpFetcher.
Removed Removed "data" package (and its classes) in favor of new classes in "doc" package.
Removed JDBC crawl store implementation has been removed in favor of NoSQL only (MVStore + MongoDB).
Removed ISitemapResolverFactory removed in favor of ISitemapResolver.
Removed Removed HttpCollectorEvent (now relying on CollectorEvent only).