All Classes Interface Summary Class Summary Enum Summary Exception Summary
Class |
Description |
AbstractDelay |
Convenience class to encapsulate various delay strategies.
|
AbstractDelayResolver |
Base implementation for creating voluntary delays between URL downloads.
|
AbstractHttpFetcher |
|
AbstractLinkExtractor |
Base class for link extraction providing common configuration settings.
|
AbstractTextLinkExtractor |
Base class for link extraction from text documents, providing common
configuration settings such as being able to apply extraction to specific
documents only, and being able to specify one or more metadata fields
from which to grab the text for extracting links.
|
ApacheHttpUtil |
Utility methods for fetcher implementations using Apache HttpClient.
|
ApacheRedirectCaptureStrategy |
This class is used by each crawler instance to capture the closest
redirect target whether it is part of a redirect chain or not.
|
Browser |
A web browser.
|
Browser.CustomDriverOptions |
|
Browser.WebDriverBuilder |
|
CrawlerDelay |
It is assumed there will be one instance of this class per crawler defined.
|
DocImageHandler |
Handles images associated with a document (which is different than a document
being itself an image).
|
DocImageHandler.DirStructure |
|
DocImageHandler.Target |
|
DOMLinkExtractor |
Extracts links from a Document Object Model (DOM) representation of an
HTML, XHTML, or XML document content based on values of matching
elements and attributes.
|
FeaturedImageProcessor |
Document processor that extract the "main" image from HTML pages.
|
FeaturedImageProcessor.Quality |
|
FeaturedImageProcessor.Storage |
|
FeaturedImageProcessor.StorageDiskStructure |
|
GenericCanonicalLinkDetector |
Generic canonical link detector.
|
GenericDelayResolver |
Default implementation for creating voluntary delays between URL downloads.
|
GenericDelayResolver.DelaySchedule |
|
GenericDelayResolver.DelaySchedule.DOW |
|
GenericHttpFetcher |
Default implementation of IHttpFetcher , based on Apache HttpClient.
|
GenericHttpFetcherConfig |
Generic HTTP Fetcher configuration.
|
GenericLinkExtractor |
Deprecated.
|
GenericRecrawlableResolver |
Relies on both sitemap directives and custom instructions for
establishing the minimum frequency between each document recrawl.
|
GenericRecrawlableResolver.MinFrequency |
|
GenericRecrawlableResolver.SitemapSupport |
|
GenericRedirectURLProvider |
Provide redirect URLs by grabbing them from the HTTP Response
Location header value.
|
GenericSitemapResolver |
|
GenericURLNormalizer |
Generic implementation of IURLNormalizer that should satisfy
most URL normalization needs.
|
GenericURLNormalizer.Normalization |
|
GenericURLNormalizer.Replace |
|
HstsResolver |
Class handling HSTS support for servers supporting it.
|
HtmlLinkExtractor |
Html link extractor for URLs found in HTML and possibly other text files.
|
HtmlLinkExtractor.RegexPair |
|
HttpAuthConfig |
Generic HTTP Fetcher authentication configuration.
|
HttpCollector |
Main application class.
|
HttpCollectorConfig |
HTTP Collector configuration.
|
HttpCommitterPipeline |
|
HttpCommitterPipelineContext |
|
HttpCrawler |
The HTTP Crawler.
|
HttpCrawlerConfig |
HTTP Crawler configuration.
|
HttpCrawlerConfig.HttpMethodSupport |
|
HttpCrawlerConfig.ReferencedLinkType |
|
HttpCrawlerEvent |
HTTP Crawler event names.
|
HttpCrawlState |
Represents a URL crawling status.
|
HttpDocInfo |
A URL being crawled holding relevant crawl information.
|
HttpDocMetadata |
Metadata constants for common metadata field
names typically set by the HTTP Collector crawler.
|
HttpFetchClient |
Fetches HTTP resources, trying all configured http fetchers, defaulting
to GenericHttpFetcher with default configuration if none are defined.
|
HttpFetchClientResponse |
Hold HTTP response information obtained from fetching a document
using HttpFetchClient.
|
HttpFetchException |
Checked exception thrown upon encountering an error performing
an HTTP Fetch
|
HttpFetchResponseBuilder |
Builder facilitating creation of an HTTP fetch response.
|
HttpImporterPipeline |
All execution steps of a document processing from the moment it is
obtained from queue up to importing it.
|
HttpImporterPipelineContext |
|
HttpMethod |
|
HttpQueuePipeline |
Performs a URL handling logic before actual processing of the document
it represents takes place.
|
HttpQueuePipelineContext |
|
HttpSnifferConfig |
Configuration for HttpSniffer .
|
ICanonicalLinkDetector |
Detects and return any canonical URL found in documents, whether from
the HTTP headers (metadata), or from a page content (usually HTML).
|
IDelayResolver |
Resolves and creates intentional "delays" to increase document download
time intervals.
|
IHttpDocumentProcessor |
Custom processing (optional) performed on a document.
|
IHttpFetcher |
Fetches HTTP resources.
|
IHttpFetchResponse |
|
ILinkExtractor |
Responsible for finding links in documents.
|
ImageCache |
Caches images.
|
IRecrawlableResolver |
Indicates whether a document that was successfully crawled on a previous
crawling session should be recrawled or not.
|
IRedirectURLProvider |
Responsible for providing a target absolute URL each time an HTTP redirect
is encountered when invoking a URL.
|
IRobotsMetaProvider |
Responsible for extracting robot information from a page.
|
IRobotsTxtFilter |
Holds a robots.txt rule.
|
IRobotsTxtProvider |
Given a URL, extract any "robots.txt" rules.
|
ISitemapResolver |
Given a URL root, resolve the corresponding sitemap(s), if any, and
only if it has not yet been resolved for a crawling session.
|
IStartURLsProvider |
Provide starting URLs for crawling.
|
IURLNormalizer |
Responsible for normalizing URLs.
|
LastModifiedMetadataChecksummer |
Default implementation of IMetadataChecksummer for the
Norconex HTTP Collector which simply
returns the exact value of the "Last-Modified" HTTP header field, or
null if not present.
|
Link |
Represents a link extracted from a document.
|
PhantomJSDocumentFetcher |
Deprecated.
|
PhantomJSDocumentFetcher.Quality |
|
PhantomJSDocumentFetcher.Storage |
|
PhantomJSDocumentFetcher.StorageDiskStructure |
|
ReferenceDelayResolver |
Introduces different delays between document downloads based on matching
document reference (URL) patterns.
|
ReferenceDelayResolver.DelayReferencePattern |
|
RegexLinkExtractor |
Link extractor using regular expressions to extract links found in text
documents.
|
RobotsMeta |
|
RobotsTxt |
|
ScaledImage |
|
ScreenshotHandler |
Takes screenshot of pages using a Selenium WebDriver .
|
SegmentCountURLFilter |
Filters URL based based on the number of URL segments.
|
SiteDelay |
|
SitemapChangeFrequency |
|
StandardRobotsMetaProvider |
|
StandardRobotsTxtProvider |
|
ThreadDelay |
|
TikaLinkExtractor |
|
TrustAllX509TrustManager |
A very unsafe trust manager accepting ALL certificates.
|
URLCrawlScopeStrategy |
By default a crawler will try to follow all links it discovers.
|
URLStatusCrawlerEventListener |
Store on file all URLs that were "fetched", along with their HTTP response
code.
|
WebDriverHttpFetcher |
Uses Selenium WebDriver support for using native browsers to crawl documents.
|
WebDriverHttpFetcherConfig |
|
WebDriverHttpFetcherConfig.WaitElementType |
|
XMLFeedLinkExtractor |
Link extractor for extracting links out of
RSS and
Atom XML feeds.
|