All Classes (Norconex HTTP Collector 3.1.0 API)

All Classes Interface Summary Class Summary Enum Summary Exception Summary
Class	Description
AbstractDelay	Convenience class to encapsulate various delay strategies.
AbstractDelayResolver	Base implementation for creating voluntary delays between URL downloads.
AbstractHttpFetcher	Base class implementing the `AbstractHttpFetcher.accept(Doc, HttpMethod)` method using reference filters to determine if this fetcher will accept to fetch a URL and delegating the HTTP method check to its own `AbstractHttpFetcher.accept(HttpMethod)` abstract method.
AbstractLinkExtractor	Base class for link extraction providing common configuration settings.
AbstractTextLinkExtractor	Base class for link extraction from text documents, providing common configuration settings such as being able to apply extraction to specific documents only, and being able to specify one or more metadata fields from which to grab the text for extracting links.
ApacheHttpUtil	Utility methods for fetcher implementations using Apache HttpClient.
ApacheRedirectCaptureStrategy	This class is used by each crawler instance to capture the closest redirect target whether it is part of a redirect chain or not.
Browser	A web browser.
Browser.CustomDriverOptions
Browser.WebDriverBuilder
CrawlerDelay	It is assumed there will be one instance of this class per crawler defined.
DocImageHandler	Handles images associated with a document (which is different than a document being itself an image).
DocImageHandler.DirStructure
DocImageHandler.Target
DOMLinkExtractor	Extracts links from a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content based on values of matching elements and attributes.
FeaturedImageProcessor	Document processor that extract the "main" image from HTML pages.
FeaturedImageProcessor.Quality
FeaturedImageProcessor.Storage
FeaturedImageProcessor.StorageDiskStructure
GenericCanonicalLinkDetector	Generic canonical link detector.
GenericDelayResolver	Default implementation for creating voluntary delays between URL downloads.
GenericDelayResolver.DelaySchedule
GenericDelayResolver.DelaySchedule.DOW
GenericHttpFetcher	Default implementation of `IHttpFetcher`, based on Apache HttpClient.
GenericHttpFetcherConfig	Generic HTTP Fetcher configuration.
GenericLinkExtractor	Deprecated. Since 3.0.0, use `HtmlLinkExtractor` or `DOMLinkExtractor` instead.
GenericRecrawlableResolver	Relies on both sitemap directives and custom instructions for establishing the minimum frequency between each document recrawl.
GenericRecrawlableResolver.MinFrequency
GenericRecrawlableResolver.SitemapSupport
GenericRedirectURLProvider	Provide redirect URLs by grabbing them from the HTTP Response `Location` header value.
GenericSitemapResolver	Implementation of `ISitemapResolver` as per sitemap.xml standard defined at http://www.sitemaps.org/protocol.html.
GenericURLNormalizer	Generic implementation of `IURLNormalizer` that should satisfy most URL normalization needs.
GenericURLNormalizer.Normalization
GenericURLNormalizer.Replace
HstsResolver	Class handling HSTS support for servers supporting it.
HtmlLinkExtractor	Html link extractor for URLs found in HTML and possibly other text files.
HtmlLinkExtractor.RegexPair
HttpAuthConfig	Generic HTTP Fetcher authentication configuration.
HttpCollector	Main application class.
HttpCollectorConfig	HTTP Collector configuration.
HttpCommitterPipeline
HttpCommitterPipelineContext
HttpCrawler	The HTTP Crawler.
HttpCrawlerConfig	HTTP Crawler configuration.
HttpCrawlerConfig.HttpMethodSupport
HttpCrawlerConfig.ReferencedLinkType
HttpCrawlerEvent	HTTP Crawler event names.
HttpCrawlState	Represents a URL crawling status.
HttpDocInfo	A URL being crawled holding relevant crawl information.
HttpDocMetadata	Metadata constants for common metadata field names typically set by the HTTP Collector crawler.
HttpFetchClient	Fetches HTTP resources, trying all configured http fetchers, defaulting to `GenericHttpFetcher` with default configuration if none are defined.
HttpFetchClientResponse	Hold HTTP response information obtained from fetching a document using HttpFetchClient.
HttpFetchException	Checked exception thrown upon encountering an error performing an HTTP Fetch
HttpFetchResponseBuilder	Builder facilitating creation of an HTTP fetch response.
HttpImporterPipeline	All execution steps of a document processing from the moment it is obtained from queue up to importing it.
HttpImporterPipelineContext
HttpMethod
HttpQueuePipeline	Performs a URL handling logic before actual processing of the document it represents takes place.
HttpQueuePipelineContext
HttpSniffer
HttpSniffer.SniffedResponseHeaders
HttpSnifferConfig	Configuration for `HttpSniffer`.
ICanonicalLinkDetector	Detects and return any canonical URL found in documents, whether from the HTTP headers (metadata), or from a page content (usually HTML).
IDelayResolver	Resolves and creates intentional "delays" to increase document download time intervals.
IHttpDocumentProcessor	Custom processing (optional) performed on a document.
IHttpFetcher	Fetches HTTP resources.
IHttpFetchResponse
ILinkExtractor	Responsible for finding links in documents.
ImageCache	Caches images.
IRecrawlableResolver	Indicates whether a document that was successfully crawled on a previous crawling session should be recrawled or not.
IRedirectURLProvider	Responsible for providing a target absolute URL each time an HTTP redirect is encountered when invoking a URL.
IRobotsMetaProvider	Responsible for extracting robot information from a page.
IRobotsTxtFilter	Holds a robots.txt rule.
IRobotsTxtProvider	Given a URL, extract any "robots.txt" rules.
ISitemapResolver	Given a URL root, resolve the corresponding sitemap(s), if any, and only if it has not yet been resolved for a crawling session.
IStartURLsProvider	Provide starting URLs for crawling.
IURLNormalizer	Responsible for normalizing URLs.
LastModifiedMetadataChecksummer	Default implementation of `IMetadataChecksummer` for the Norconex HTTP Collector which simply returns the exact value of the "Last-Modified" HTTP header field, or `null` if not present.
Link	Represents a link extracted from a document.
PhantomJSDocumentFetcher	Deprecated. Since 3.0.0 use `WebDriverHttpFetcher`
PhantomJSDocumentFetcher.Quality
PhantomJSDocumentFetcher.Storage
PhantomJSDocumentFetcher.StorageDiskStructure
ReferenceDelayResolver	Introduces different delays between document downloads based on matching document reference (URL) patterns.
ReferenceDelayResolver.DelayReferencePattern
RegexLinkExtractor	Link extractor using regular expressions to extract links found in text documents.
RobotsMeta
RobotsTxt
ScaledImage
ScreenshotHandler	Takes screenshot of pages using a Selenium `WebDriver`.
SegmentCountURLFilter	Filters URL based based on the number of URL segments.
SiteDelay
SitemapChangeFrequency	Sitemap change frequency unit, as defined on http://www.sitemaps.org/protocol.html
StandardRobotsMetaProvider	Implementation of `IRobotsMetaProvider` as per X-Robots-Tag and ROBOTS standards.
StandardRobotsTxtProvider	Implementation of `IRobotsTxtProvider` as per the robots.txt standard described at http://www.robotstxt.org/robotstxt.html.
ThreadDelay
TikaLinkExtractor	Implementation of `ILinkExtractor` using Apache Tika to perform URL extractions from HTML documents.
TrustAllX509TrustManager	A very unsafe trust manager accepting ALL certificates.
URLCrawlScopeStrategy	By default a crawler will try to follow all links it discovers.
URLStatusCrawlerEventListener	Store on file all URLs that were "fetched", along with their HTTP response code.
WebDriverHttpFetcher	Uses Selenium WebDriver support for using native browsers to crawl documents.
WebDriverHttpFetcherConfig	Configuration for `WebDriverHttpFetcher`.
WebDriverHttpFetcherConfig.WaitElementType
XMLFeedLinkExtractor	Link extractor for extracting links out of RSS and Atom XML feeds.