Updated |
Updated transitive dependencies with known vulnerabilities.
|
|
Fixed |
Fixed data store engine errors (via "collector-core" dependency update).
|
#766 |
Fixed |
Fixed "Connection Pool Shut Down" error (from collector-core update).
|
#770 |
New |
Can now automatically stop a crawler when it reached a certain number
of crawler events, thanks to the new Collector Core
"StopCrawlerOnMaxEventListener" class.
|
#172 |
New |
Can now send deletion requests to Committers based on "rejected" events
thanks to the new Collector Core "DeleteRejectedEventListener" class.
|
#211 |
New |
Added optional deduplication feature based on metadata and/or document
checksums.
|
#579 |
New |
Added support for ETag/If-None-Match to GenericHttpFetcher.
|
#182 |
New |
Added supports HTTP Strict Transport Security (HSTS) to
GenericHttpFetcher. Active by default.
|
#694 |
New |
Added support for restricting which HTTP method GenericHttpFetcher
accepts.
|
#654 |
New |
Added "DISABLED", "OPTIONAL", "REQUIRED" options for HTTP
HEAD and GET methods.
|
#654 |
New |
New DOMLinkExtractor repeatable "extractSelector" and
"noExtractSelector".
|
|
Updated |
HttpCrawlerConfig metadata checksummer move to base class CrawlerConfig.
|
|
Updated |
DOMLinkExtractor configurable "dom" elements deprecated in favor
of "linkSelector".
|
|
Updated |
Checksummers "disabled" flag deprecated in favor of setting a null
checksummer or using a self-closed checksummer tag in config.
|
|
Fixed |
Fixed invalid configuration in POM "maven-dependency-plugin".
|
|
Removed |
Removed JEF dependency in favor of improved JMX for tracking.
|
|
Updated |
WebDriver HttpSniffer can now have its max buffer size configured.
|
#751 |
Fixed |
Fixed NullPointerException when resolving sitemaps.
|
#738 |
Fixed |
Fixed sitemap.xml URL entries not being extracted when they contain
custom elements.
|
#758 |
New |
New IHttpFetcher for making HTTP requests. Multiple instances
can now be specified and tried in sequence. This replaces
IHttpClientFactory, IHttpDocumentFetcher, and IHttpMetadataFetcher.
|
|
New |
New WebDriverHttpFetcher for using popular browsers in headless mode
(Chrome, Firefox, ...). Ideal for Javascript-driven websites and
taking screenshots.
|
|
New |
Now supports If-Modified-Since for more efficient crawling.
|
#637 |
New |
New flag for loading start URLs asynchronously.
|
|
New |
New HttpCollectorEvent.
|
|
New |
New crawler event: URLS_POST_IMPORTED
(in addition to new events from Collector Core).
|
|
New |
New GenericHttpFetcher, replacing GenericHttpClientFactory and
GenericHttpDocumentFetcher.
|
|
New |
New "disableSNI" crawler configuration option to disable Server Name
Indication.
|
#577 |
New |
New DOMLinkExtractor using JSoup to extract links.
|
|
New |
Link extractors can now extracting links from metadata fields in
addition to content.
|
|
New |
New "postImportLinks" configuration option for considering links from
metadata fields created during import for crawling
(providing a way to extra links from parsed binaries).
|
#428 |
New |
New DOMLinkExtractor that uses JSoup to extract links from
HTML/XML documents.
|
#668 |
New |
New HttpDoc, HttpDocInfo, HttpDocMetadata (new or renamed).
|
|
New |
New metadata field set when URL changes from normalization:
"collector.originalReference".
|
|
New |
Form authentication can now parse and submit HTML forms
(taking login page URL instead of form action URL).
|
|
Updated |
PhantomJSDocumentFetcher now deprecated in favor of
WebDriverHttpFetcher.
|
|
Updated |
Now using XML class from Norconex Commons Lang for loading/saving
configuration.
|
|
Updated |
User-Agent no longer set directly on crawler config. It can be set
on IHttpFetcher implementations that support it.
|
|
Updated |
Now using SLF4J for logging.
|
|
Updated |
HtmlLinkExtractor (and DOMLinkExtractor) now extracts all attributes
of referrer links and adds them to the target document metadata.
This can be disabled with the #setIgnoreLinkData() method.
|
#668 |
Updated |
Dependency updates: Norconex Collector Core 2.0.0,
Jetty 9.4.12.v20180830.
|
|
Updated |
Now requires Java 8 or higher.
|
|
Updated |
Lists are now replacing arrays in most places.
|
|
Updated |
Path is used in addition/instead of File in many places.
|
|
Updated |
Default working directory structure has been improved.
|
|
Updated |
Improved handling of "trustAllSSLCertificate" in GenericHttpFetcher
by auto-accepting/storing certificates.
|
#592 |
Updated |
Dates handling now consider the zone.
|
|
Updated |
Can now control whether to store as metadata all extracted links,
just in-scope ones, out-of-scope, both, or none.
|
|
Updated |
Link extractors and canonical URL classes are now in their own
distinct packages.
|
|
Updated |
GenericLinkExtractor renamed to HtmlLinkExtractor.
|
|
Updated |
Sitemap information now stored in the new data store engine as opposed
to its own storage mechanism.
|
|
Updated |
URLStatusCrawlerEventListener now saves as CSV instead of TSV.
|
|
Updated |
HttpCrawlerEvent now only holds HTTP crawler event names.
Crawler events are now all of CrawlerEvent type.
|
|
Removed |
Removed some code deprecated in releases before 3.0.0.
|
|
Removed |
IHttpClientFactory, IHttpDocumentFetcher, and IHttpMetadataFetcher and
implementations were removed in favor of IHttpFetcher and
GenericHttpFetcher.
|
|
Removed |
Removed "data" package (and its classes) in favor of new classes
in "doc" package.
|
|
Removed |
JDBC crawl store implementation has been removed in favor of
NoSQL only (MVStore + MongoDB).
|
|
Removed |
ISitemapResolverFactory removed in favor of ISitemapResolver.
|
|
Removed |
Removed HttpCollectorEvent (now relying on CollectorEvent only).
|
|