Norconex Web Crawler

3.x Release Notes

Release History

Version	Date	Description
3.1.0	2025-05-24	Minor release.
3.0.2	2023-07-09	Maintenance release.
3.0.1	2022-08-30	Maintenance release.
3.0.0	2022-01-05	Major release. NOT a drop-in replacement for 2.x.

3.1.0 Minor release. Release date 2025-05-24 Download

New	HTTP status code and reason are now captured as metadata field ""collector.http-status-code" and "collector.http-status-reason" respectively. Applies to both GenericHttpFetcher and WebDriverHttpFetcher.	#1120
New	New "removeTrailingFragment" option on GenericURLNormalizer.	#1061
New	New WebDriverHttpFetcher browser options: CUSTOM.
New	Can now configure HttpSniffer "host" with WebDriverHttpFetcher.
New	Can now configure HttpSniffer "responseTimeout" with WebDriverHttpFetcher.
New	Now supports chained proxy when using HttpSniffer.
New	Can now pass arguments to WebDriverHttpFetcher browser (in addition to regular WebDriver capabilities).	#844
Updated	Now requires Java 11 to run.
New	Can now specify more than one URL Normalizer.	#1063
Updated	HttpSniffer replaced Mobproxy and original LittleProxy dependency with more recent LittleProxy fork (LittleProxy/LittleProxy).	#1122
Updated	Maven dependency updates: Guava 33.0.0, imgscalr 4.2, Jetty 9.4.54.v20240208, Netty 4.1.96.Final, Selenium 4.23.0, Apache Commons Lang 3.14.0, LittleProxy 2.4.0. Removed browsermob.
Fixed	Fixed redirected URL target being considered orphan when source URL is not yet ready for recrawl.	#1121
Fixed	Fixed a NullPointerException in WebDriverHttpFetcher occurring in some init tasks.	#1117
Fixed	Fixed stop command causing error.	#1067
Fixed	Fixed Opera WebDriverHttpFetcher proxy settings not being applied.

3.0.2 Maintenance release. Release date 2023-07-09 Download

New	Fixed GenericSitemapResolver NPE when the sitemap content-type could not be detected.	#803
Updated	Maven dependency updates: norconex-commons-maven-parent 1.0.2, norconex-collector-core 2.0.2, norconex-importer 3.0.1, Guava 32.0.0-jre, Selenium 4.0.0, Jetty 9.4.51.v20230217.

3.0.1 Maintenance release. Release date 2022-08-30 Download

New	New MDC attributes which can be used in supporting logging framework: "ctx:crawler.id", "ctx:crawler.id.safe", "ctx:collector.id.safe", and "ctx:collector.id.safe".	#790
Fixed	Fixed invalid relative URLs resolution when such URLs contain a colon (:) that is not the scheme.	#788
Fixed	Fixed not always considering effective top level domain properly in HSTS resolution.	#785
Fixed	Fixed occasional concurrency issue when crawler terminates.	#781
Fixed	Fixed the crawler sometimes not exiting when done.

3.0.0 Major release. NOT a drop-in replacement for 2.x. Release date 2022-01-05 Download

Updated	Updated transitive dependencies with known vulnerabilities.
Fixed	Fixed data store engine errors (via "collector-core" dependency update).	#766
Fixed	Fixed "Connection Pool Shut Down" error (from collector-core update).	#770
New	Can now automatically stop a crawler when it reached a certain number of crawler events, thanks to the new Collector Core "StopCrawlerOnMaxEventListener" class.	#172
New	Can now send deletion requests to Committers based on "rejected" events thanks to the new Collector Core "DeleteRejectedEventListener" class.	#211
New	Added optional deduplication feature based on metadata and/or document checksums.	#579
New	Added support for ETag/If-None-Match to GenericHttpFetcher.	#182
New	Added supports HTTP Strict Transport Security (HSTS) to GenericHttpFetcher. Active by default.	#694
New	Added support for restricting which HTTP method GenericHttpFetcher accepts.	#654
New	Added "DISABLED", "OPTIONAL", "REQUIRED" options for HTTP HEAD and GET methods.	#654
New	New DOMLinkExtractor repeatable "extractSelector" and "noExtractSelector".
Updated	HttpCrawlerConfig metadata checksummer move to base class CrawlerConfig.
Updated	DOMLinkExtractor configurable "dom" elements deprecated in favor of "linkSelector".
Updated	Checksummers "disabled" flag deprecated in favor of setting a null checksummer or using a self-closed checksummer tag in config.
Fixed	Fixed invalid configuration in POM "maven-dependency-plugin".
Removed	Removed JEF dependency in favor of improved JMX for tracking.
Updated	WebDriver HttpSniffer can now have its max buffer size configured.	#751
Fixed	Fixed NullPointerException when resolving sitemaps.	#738
Fixed	Fixed sitemap.xml URL entries not being extracted when they contain custom elements.	#758
New	New IHttpFetcher for making HTTP requests. Multiple instances can now be specified and tried in sequence. This replaces IHttpClientFactory, IHttpDocumentFetcher, and IHttpMetadataFetcher.
New	New WebDriverHttpFetcher for using popular browsers in headless mode (Chrome, Firefox, ...). Ideal for Javascript-driven websites and taking screenshots.
New	Now supports If-Modified-Since for more efficient crawling.	#637
New	New flag for loading start URLs asynchronously.
New	New HttpCollectorEvent.
New	New crawler event: URLS_POST_IMPORTED (in addition to new events from Collector Core).
New	New GenericHttpFetcher, replacing GenericHttpClientFactory and GenericHttpDocumentFetcher.
New	New "disableSNI" crawler configuration option to disable Server Name Indication.	#577
New	New DOMLinkExtractor using JSoup to extract links.
New	Link extractors can now extracting links from metadata fields in addition to content.
New	New "postImportLinks" configuration option for considering links from metadata fields created during import for crawling (providing a way to extra links from parsed binaries).	#428
New	New DOMLinkExtractor that uses JSoup to extract links from HTML/XML documents.	#668
New	New HttpDoc, HttpDocInfo, HttpDocMetadata (new or renamed).
New	New metadata field set when URL changes from normalization: "collector.originalReference".
New	Form authentication can now parse and submit HTML forms (taking login page URL instead of form action URL).
Updated	PhantomJSDocumentFetcher now deprecated in favor of WebDriverHttpFetcher.
Updated	Now using XML class from Norconex Commons Lang for loading/saving configuration.
Updated	User-Agent no longer set directly on crawler config. It can be set on IHttpFetcher implementations that support it.
Updated	Now using SLF4J for logging.
Updated	HtmlLinkExtractor (and DOMLinkExtractor) now extracts all attributes of referrer links and adds them to the target document metadata. This can be disabled with the #setIgnoreLinkData() method.	#668
Updated	Dependency updates: Norconex Collector Core 2.0.0, Jetty 9.4.12.v20180830.
Updated	Now requires Java 8 or higher.
Updated	Lists are now replacing arrays in most places.
Updated	Path is used in addition/instead of File in many places.
Updated	Default working directory structure has been improved.
Updated	Improved handling of "trustAllSSLCertificate" in GenericHttpFetcher by auto-accepting/storing certificates.	#592
Updated	Dates handling now consider the zone.
Updated	Can now control whether to store as metadata all extracted links, just in-scope ones, out-of-scope, both, or none.
Updated	Link extractors and canonical URL classes are now in their own distinct packages.
Updated	GenericLinkExtractor renamed to HtmlLinkExtractor.
Updated	Sitemap information now stored in the new data store engine as opposed to its own storage mechanism.
Updated	URLStatusCrawlerEventListener now saves as CSV instead of TSV.
Updated	HttpCrawlerEvent now only holds HTTP crawler event names. Crawler events are now all of CrawlerEvent type.
Removed	Removed some code deprecated in releases before 3.0.0.
Removed	IHttpClientFactory, IHttpDocumentFetcher, and IHttpMetadataFetcher and implementations were removed in favor of IHttpFetcher and GenericHttpFetcher.
Removed	Removed "data" package (and its classes) in favor of new classes in "doc" package.
Removed	JDBC crawl store implementation has been removed in favor of NoSQL only (MVStore + MongoDB).
Removed	ISitemapResolverFactory removed in favor of ISitemapResolver.
Removed	Removed HttpCollectorEvent (now relying on CollectorEvent only).