All Classes and Interfaces

Class
Description
Convenience class to encapsulate various delay strategies.
Base implementation for creating voluntary delays between URL downloads.
Base class implementing the AbstractHttpFetcher.accept(Doc, HttpMethod) method using reference filters to determine if this fetcher will accept to fetch a URL and delegating the HTTP method check to its own AbstractHttpFetcher.accept(HttpMethod) abstract method.
Base class for link extraction providing common configuration settings.
Base class for link extraction from text documents, providing common configuration settings such as being able to apply extraction to specific documents only, and being able to specify one or more metadata fields from which to grab the text for extracting links.
Utility methods for fetcher implementations using Apache HttpClient.
This class is used by each crawler instance to capture the closest redirect target whether it is part of a redirect chain or not.
A web browser.
 
 
It is assumed there will be one instance of this class per crawler defined.
Handles images associated with a document (which is different than a document being itself an image).
 
 
Extracts links from a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content based on values of matching elements and attributes.
Document processor that extract the "main" image from HTML pages.
 
 
 
Generic canonical link detector.
Default implementation for creating voluntary delays between URL downloads.
 
 
Default implementation of IHttpFetcher, based on Apache HttpClient.
Generic HTTP Fetcher configuration.
Deprecated.
Since 3.0.0, use HtmlLinkExtractor or DOMLinkExtractor instead.
Relies on both sitemap directives and custom instructions for establishing the minimum frequency between each document recrawl.
 
 
Provide redirect URLs by grabbing them from the HTTP Response Location header value.
Implementation of ISitemapResolver as per sitemap.xml standard defined at http://www.sitemaps.org/protocol.html.
Generic implementation of IURLNormalizer that should satisfy most URL normalization needs.
 
 
Class handling HSTS support for servers supporting it.
Html link extractor for URLs found in HTML and possibly other text files.
 
Generic HTTP Fetcher authentication configuration.
Main application class.
HTTP Collector configuration.
 
 
The HTTP Crawler.
HTTP Crawler configuration.
 
 
HTTP Crawler event names.
Represents a URL crawling status.
A URL being crawled holding relevant crawl information.
Metadata constants for common metadata field names typically set by the HTTP Collector crawler.
Fetches HTTP resources, trying all configured http fetchers, defaulting to GenericHttpFetcher with default configuration if none are defined.
Hold HTTP response information obtained from fetching a document using HttpFetchClient.
Checked exception thrown upon encountering an error performing an HTTP Fetch
Builder facilitating creation of an HTTP fetch response.
All execution steps of a document processing from the moment it is obtained from queue up to importing it.
 
 
Performs a URL handling logic before actual processing of the document it represents takes place.
 
 
 
Configuration for HttpSniffer.
Detects and return any canonical URL found in documents, whether from the HTTP headers (metadata), or from a page content (usually HTML).
Resolves and creates intentional "delays" to increase document download time intervals.
Custom processing (optional) performed on a document.
Fetches HTTP resources.
 
Responsible for finding links in documents.
Caches images.
Indicates whether a document that was successfully crawled on a previous crawling session should be recrawled or not.
Responsible for providing a target absolute URL each time an HTTP redirect is encountered when invoking a URL.
Responsible for extracting robot information from a page.
Holds a robots.txt rule.
Given a URL, extract any "robots.txt" rules.
Given a URL root, resolve the corresponding sitemap(s), if any, and only if it has not yet been resolved for a crawling session.
Provide starting URLs for crawling.
Responsible for normalizing URLs.
Default implementation of IMetadataChecksummer for the Norconex HTTP Collector which simply returns the exact value of the "Last-Modified" HTTP header field, or null if not present.
Represents a link extracted from a document.
Deprecated.
Since 3.0.0 use WebDriverHttpFetcher
 
 
 
Introduces different delays between document downloads based on matching document reference (URL) patterns.
 
Link extractor using regular expressions to extract links found in text documents.
 
 
 
Takes screenshot of pages using a Selenium WebDriver.
Filters URL based based on the number of URL segments.
 
Sitemap change frequency unit, as defined on http://www.sitemaps.org/protocol.html
Implementation of IRobotsMetaProvider as per X-Robots-Tag and ROBOTS standards.
Implementation of IRobotsTxtProvider as per the robots.txt standard described at http://www.robotstxt.org/robotstxt.html.
 
Implementation of ILinkExtractor using Apache Tika to perform URL extractions from HTML documents.
A very unsafe trust manager accepting ALL certificates.
By default a crawler will try to follow all links it discovers.
Store on file all URLs that were "fetched", along with their HTTP response code.
Uses Selenium WebDriver support for using native browsers to crawl documents.
Configuration for WebDriverHttpFetcher.
 
Link extractor for extracting links out of RSS and Atom XML feeds.