java.lang.Object
- com.norconex.collector.core.crawler.CrawlerConfig
- - com.norconex.collector.http.crawler.HttpCrawlerConfig

All Implemented Interfaces:

IXMLConfigurable
```
public class HttpCrawlerConfig
extends CrawlerConfig
```
HTTP Crawler configuration.

Start URLs

Crawling begins with one or more "start" URLs. Multiple start URLs can be defined, in a combination of ways:
- url: A start URL directly in the configuration (see setStartURLs(List)).
- urlsFile: A path to a file containing a list of start URLs (see setStartURLsFiles(List)). One per line.
- sitemap: A URL pointing to a sitemap XML file that contains the URLs to crawl (see setStartSitemapURLs(List)).
- provider: Your own class implementing IStartURLsProvider to dynamically provide a list of start URLs (see setStartURLsProviders(List)).
Scope: To limit crawling to specific web domains, and avoid creating many filters to that effect, you can tell the crawler to "stay" within the web site "scope" with setUrlCrawlScopeStrategy(URLCrawlScopeStrategy).

URL Normalization

Pages on web sites are often referenced using different URL patterns. Such URL variations can fool the crawler into downloading the same document multiple times. To avoid this, URLs are "normalized". That is, they are converted so they are always formulated the same way. By default, the crawler only applies normalization in ways that are semantically equivalent (see GenericURLNormalizer).

Crawl Speed

Be kind to web sites you crawl. Being too aggressive can be perceived as a cyber-attack by the targeted web site (e.g., DoS attack). This can lead to your crawler being blocked.

For this reason, the crawler plays nice by default. It will wait a few seconds between each page download, regardless of the maximum number of threads specified or whether pages crawled are on different web sites. This can of course be changed to be as fast as you want. See GenericDelayResolver) for changing default options. You can also provide your own "delay resolver" by supplying a class implementing IDelayResolver.

Crawl Depth

The crawl depth represents how many level from the start URL the crawler goes. From a browser user perspective, it can be seen as the number of link "clicks" required from a start URL in order to get to a specific page. The crawler will crawl as deep for as long as it discovers new URLs not getting rejected by your configuration. This is not always desirable. For instance, a web site could have dynamically generated URLs with infinite possibilities (e.g., dynamically generated web calendars). To avoid infinite crawls, it is recommended to limit the maximum depth to something reasonable for your site with setMaxDepth(int).

Keeping downloaded files

Downloaded files are deleted after being processed. Set setKeepDownloads(boolean) to true in order to preserve them. Files will be kept under a new "downloads" folder found under your working directory. Keep in mind this is not a method for cloning a site. Use with caution on large sites as it can quickly fill up the local disk space.

Keeping Referenced Links

By default the crawler stores, as metadata, URLs extracted from documents that are in scope. Exceptions are pages discovered at the configured maximum depth (setMaxDepth(int)). This can be changed using the setKeepReferencedLinks(Set) method. Changing this setting has no incidence on what page gets crawled. Possible options are:
- INSCOPE: Default. Store "in-scope" links as HttpDocMetadata.REFERENCED_URLS.
- OUTSCOPE: Store "out-of-scope" links as HttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE.
- MAXDEPTH: Also store links extracted on pages at max depth. Must be used with at least one other option to have any effect.
Orphan documents

Orphans are valid documents, which on subsequent crawls can no longer be reached (e.g. there are no longer referenced). This is regardless whether the file has been deleted or not at the source. You can tell the crawler how to handle those with CrawlerConfig.setOrphansStrategy(OrphansStrategy). Possible options are:
- PROCESS: Default. Tries to crawl orphans normally as if they were still reachable by the crawler.
- IGNORE: Does nothing with orphans (not deleted, not processed)..
- DELETE: Orphans are sent to your Committer for deletion.
Error Handling

By default the crawler logs exceptions while trying to prevent them from terminating a crawling session. There might be cases where you want the crawler to halt upon encountering some types of exceptions. You can do so with CrawlerConfig.setStopOnExceptions(List).

Crawler Events

The crawler fires all kind of events to notify interested parties of such things as when a document is rejected, imported, committed, etc.). You can listen to crawler events using CrawlerConfig.setEventListeners(List).

Data Store (Cache)

During and between crawl sessions, the crawler needs to preserve specific information in order to keep track of things such as a queue of document reference to process, those already processed, whether a document has been modified since last crawled, caching of document checksums, etc. For this, the crawler uses a database we call a crawl data store engine. The default implementation uses the local file system to store these (see MVStoreDataStoreEngine). While very capable and suitable for most sites, if you need a larger storage system, you can provide your own implementation with CrawlerConfig.setDataStoreEngine(IDataStoreEngine).

Document Importing

The process of transforming, enhancing, parsing to extracting plain text and many other document-specific processing activities are handled by the Norconex Importer module. See ImporterConfig for many additional configuration options.

Bad Documents

On a fresh crawl, documents that are unreachable or not obtained successfully for some reason are simply logged and ignored. On the other hand, documents that were successfully crawled once and are suddenly failing on a subsequent crawl are considered "spoiled". You can decide whether to grace (retry next time), delete, or ignore those spoiled documents with CrawlerConfig.setSpoiledReferenceStrategizer(ISpoiledReferenceStrategizer).

Committing Documents

The last step of a successful processing of a document is to store it in your preferred target repository (or repositories). For this to happen, you have to configure one or more Committers corresponding to your needs or create a custom one. You can have a look at available Committers here: https://opensource.norconex.com/committers/ See CrawlerConfig.setCommitters(List).

HTTP Fetcher

To crawl and parse a document, it needs to be downloaded first. This is the role of one or more HTTP Fetchers. GenericHttpFetcher is the default implementation and can handle most web sites. There might be cases where a more specialized way of obtaining web resources is needed. For instance, JavaScript-generated web pages are often best handled by web browsers. In such case you can use the WebDriverHttpFetcher. You can also use setHttpFetchers(List) to supply own fetcher implementation.

HTTP Methods

A fetcher typically issues an HTTP GET request to obtain a document. There might be cases where you first want to issue a separate HEAD request. One example is to filter documents based on the HTTP HEAD response information, thus possibly saving downloading large files you don't want.

You can tell the crawler how it should handle HTTP GET and HEAD requests using using setFetchHttpGet(HttpMethodSupport) and setFetchHttpHead(HttpMethodSupport) respectively. For each, the options are:
- DISABLED: No HTTP call willl be made using that method.
- OPTIONAL: If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document can still be processed successfully by the other HTTP method. Only relevant when both HEAD and GET are enabled.
- REQUIRED: If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document will be rejected and won't go any further, even if the other HTTP method was or could have been successful. Only relevant when both HEAD and GET are enabled.
If you enable only one HTTP method (default), then specifying OPTIONAL or REQUIRED for it have the same effect. At least one method needs to be enabled for an HTTP request to be attempted. By default HEAD requests are DISABLED and GET are REQUIRED. If you are unsure what settings to use, keep the defaults.

Filtering Unwanted Documents

Without filtering, you would typically crawl many documents you are not interested in. There are different types filtering offered to you, occurring at different type during a URL crawling process. The sooner in a URL processing life-cycle you filter out a document the more you can improve the crawler performance. It may be important for you to understand the differences:
- Reference filters: The fastest way to exclude a document. The filtering rule applies on the URL, before any HTTP request is made for that URL. Rejected documents are not queued for processing. They are not be downloaded (thus no URLs are extracted). The specified "delay" between downloads is not applied (i.e. no delay for rejected documents).
- Metadata filters: Applies filtering on a document metadata fields.
  
  If isFetchHttpHead() returns true, these filters will be invoked after the crawler performs a distinct HTTP HEAD request. It gives you the opportunity to filter documents based on the HTTP HEAD response to potentially save a more expensive HTTP GET request for download (but results in two HTTP requests for valid documents -- HEAD and GET). Filtering occurs before URLs are extracted.
  
  When isFetchHttpHead() is false, these filters will be invoked on the metadata of the HTTP response obtained from an HTTP GET request (as the document is downloaded). Filtering occurs after URLs are extracted.
- Document filters: Use when having access to the document itself (and its content) is required to apply filtering. Always triggered after a document is downloaded and after URLs are extracted, but before it is imported (Importer module).
- Importer filters: The Importer module also offers document filtering options. At that point a document is already downloaded and its links extracted. There are two types of filtering offered by the Importer: before and after document parsing. Use filters before parsing if you need to filter on raw content or want to prevent a more expensive parsing. Use filters after parsing when you need to read the content as plain text.
Robot Directives

By default, the crawler tries to respect instructions a web site as put in place for the benefit of crawlers. Here is a list of some of the popular ones that can be turned off or supports your own implementation.
- Robot rules: Rules defined in a "robots.txt" file at the root of a web site, or via X-Robots-Tag. See: setIgnoreRobotsTxt(boolean), setRobotsTxtProvider(IRobotsTxtProvider), setIgnoreRobotsMeta(boolean), setRobotsMetaProvider(IRobotsMetaProvider)
- HTML "nofollow": Most HTML-oriented link extractors support the rel="nofollow" attribute set on HTML links. See: HtmlLinkExtractor.setIgnoreNofollow(boolean)
- Sitemap: Sitemaps XML files are auto-detected and used to find a list of URLs to crawl. To disable detection, use setIgnoreSitemap(boolean).
- Canonical URLs: The crawler will reject URLs that are non-canonical, as per HTML <meta ...> or HTTP response instructions. To crawl non-canonical pages, use setIgnoreCanonicalLinks(boolean).
- If Modified Since: The default HTTP Fetcher (GenericHttpFetcher) uses the If-Modified-Since feature as part of its HTTP requests for web sites supporting it (only affects incremental crawls). To turn that off, use GenericHttpFetcherConfig.setDisableIfModifiedSince(boolean).
Re-crawl Frequency

The crawler will crawl any given URL at most one time per crawling session. It is possible to skip documents that are not yet "ready" to be re-crawled to speed up each crawling sessions. Sitemap.xml directives to that effect are respected by default ("frequency" and "lastmod"). You can have your own conditions for re-crawl with setRecrawlableResolver(IRecrawlableResolver). This feature can be used for instance, to crawl a "news" section of your site more frequently than let's say, an "archive" section of your site.

Change Detection (Checksums)

To find out if a document has changed from one crawling session to another, the crawler creates and keeps a digital signature, or checksum of each crawled documents. Upon crawling the same URL again, a new checksum is created and compared against the previous one. Any difference indicates a modified document. There are two checksums at play, tested at different times. One obtained from a document metadata (default is LastModifiedMetadataChecksummer, and one from the document itself MD5DocumentChecksummer. You can provide your own implementation. See: CrawlerConfig.setMetadataChecksummer(IMetadataChecksummer) and CrawlerConfig.setDocumentChecksummer(IDocumentChecksummer).

Deduplication

EXPERIMENTAL: The crawler can attempt to detect and reject documents considered as duplicates within a crawler session. A document will be considered duplicate if there was already a document processed with the same metadata or document checksum. To enable this feature, set CrawlerConfig.setMetadataDeduplicate(boolean) and/or CrawlerConfig.setDocumentDeduplicate(boolean) to true. Setting those will have no effect if the corresponding checksummers are not set (null).

Deduplication can impact crawl performance. It is recommended you use it only if you can't distinguish duplicates via other means (URL normalizer, canonical URL support, etc.). Also, you should only enable this feature if you know your checksummer(s) will generate a checksum that is acceptably unique to you.

URL Extraction

To be able to crawl a web site, links need to be extracted from web pages. It is the job of a link extractor. It is possible to use multiple link extractor for different type of content. By default, the HtmlLinkExtractor is used, but you can add others or provide your own with setLinkExtractors(List).

There might be cases where you want a document to be parsed by the Importer and establish which links to process yourself during the importing phase (for more advanced use cases). In such cases, you can identify a document metadata field to use as a URL holding tanks after importing has occurred. URLs in that field will become eligible for crawling. See setPostImportLinks(TextMatcher).

XML configuration usage:
```
<crawler
    id="(crawler unique identifier)">
  <startURLs
      stayOnDomain="[false|true]"
      includeSubdomains="[false|true]"
      stayOnPort="[false|true]"
      stayOnProtocol="[false|true]"
      async="[false|true]">
    
    <url>(a URL)</url>
    <urlsFile>(local path to a file containing URLs)</urlsFile>
    <sitemap>(URL to a sitemap XML)</sitemap>
    <provider
        class="(IStartURLsProvider implementation)"/>
  </startURLs>
  <urlNormalizers>
    <urlNormalizer
        class="(IURLNormalizer implementation)"/>
  </urlNormalizers>
  <delay
      class="(IDelayResolver implementation)"/>
  <maxDepth>(maximum crawl depth)</maxDepth>
  <keepDownloads>[false|true]</keepDownloads>
  <keepReferencedLinks>[INSCOPE|OUTSCOPE|MAXDEPTH]</keepReferencedLinks>
  <fetchHttpHead>[DISABLED|REQUIRED|OPTIONAL]</fetchHttpHead>
  <fetchHttpGet>[REQUIRED|DISABLED|OPTIONAL]</fetchHttpGet>
  <httpFetchers
      maxRetries="(number of times to retry a failed fetch attempt)"
      retryDelay="(how many milliseconds to wait between re-attempting)">
    
    <fetcher
        class="(IHttpFetcher implementation)"/>
  </httpFetchers>
  <robotsTxt
      ignore="[false|true]"
      class="(IRobotsMetaProvider implementation)"/>
  <sitemapResolver
      ignore="[false|true]"
      class="(ISitemapResolver implementation)"/>
  <recrawlableResolver
      class="(IRecrawlableResolver implementation)"/>
  <canonicalLinkDetector
      ignore="[false|true]"
      class="(ICanonicalLinkDetector implementation)"/>
  <robotsMeta
      ignore="[false|true]"
      class="(IRobotsMetaProvider implementation)"/>
  <linkExtractors>
    
    <extractor
        class="(ILinkExtractor implementation)"/>
  </linkExtractors>
  <preImportProcessors>
    
    <processor
        class="(IHttpDocumentProcessor implementation)"/>
  </preImportProcessors>
  <postImportProcessors>
    
    <processor
        class="(IHttpDocumentProcessor implementation)"/>
  </postImportProcessors>
  <postImportLinks
      keep="[false|true]">
    <fieldMatcher/>
  </postImportLinks>
</crawler>
```
Author:

Pascal Essiembre

Nested Class Summary

Nested Classes
Modifier and Type Class Description

static class HttpCrawlerConfig.HttpMethodSupport

static class HttpCrawlerConfig.ReferencedLinkType
- Nested classes/interfaces inherited from class com.norconex.collector.core.crawler.CrawlerConfig
  CrawlerConfig.OrphansStrategy

Constructor Summary

Constructors
Constructor Description

HttpCrawlerConfig()

Method Summary

All Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method	Description
`boolean`	`equals(Object other)`
`ICanonicalLinkDetector`	`getCanonicalLinkDetector()`	Gets the canonical link detector.
`IDelayResolver`	`getDelayResolver()`
`HttpCrawlerConfig.HttpMethodSupport`	`getFetchHttpGet()`	Gets whether to fetch HTTP documents using an HTTP GET request.
`HttpCrawlerConfig.HttpMethodSupport`	`getFetchHttpHead()`	Gets whether to fetch HTTP response headers using an HTTP HEAD request.
`List<IHttpFetcher>`	`getHttpFetchers()`	Gets HTTP fetchers.
`int`	`getHttpFetchersMaxRetries()`	Gets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures.
`long`	`getHttpFetchersRetryDelay()`	Gets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds).
`Set<HttpCrawlerConfig.ReferencedLinkType>`	`getKeepReferencedLinks()`	Gets what type of referenced links to keep, if any.
`List<ILinkExtractor>`	`getLinkExtractors()`	Gets link extractors.
`int`	`getMaxDepth()`
`TextMatcher`	`getPostImportLinks()`	Gets a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.
`List<IHttpDocumentProcessor>`	`getPostImportProcessors()`	Gets post-import processors.
`List<IHttpDocumentProcessor>`	`getPreImportProcessors()`	Gets pre-import processors.
`IRecrawlableResolver`	`getRecrawlableResolver()`	Gets the recrawlable resolver.
`IRobotsMetaProvider`	`getRobotsMetaProvider()`
`IRobotsTxtProvider`	`getRobotsTxtProvider()`
`ISitemapResolver`	`getSitemapResolver()`
`List<String>`	`getStartSitemapURLs()`	Gets sitemap URLs to be used as starting points for crawling.
`List<String>`	`getStartURLs()`	Gets URLs to initiate crawling from.
`List<Path>`	`getStartURLsFiles()`	Gets the file paths of seed files containing URLs to be used as "start URLs".
`List<IStartURLsProvider>`	`getStartURLsProviders()`	Gets the providers of URLs used as starting points for crawling.
`URLCrawlScopeStrategy`	`getURLCrawlScopeStrategy()`	Gets the strategy to use to determine if a URL is in scope.
`IURLNormalizer`	`getUrlNormalizer()`	Deprecated, for removal: This API element is subject to removal in a future version. Since 3.1.0, use `getUrlNormalizers()` instead.
`List<IURLNormalizer>`	`getUrlNormalizers()`	Gets URL normalizers.
`int`	`hashCode()`
`boolean`	`isFetchHttpHead()`	Deprecated. Use `getFetchHttpHead()`.
`boolean`	`isIgnoreCanonicalLinks()`	Whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed.
`boolean`	`isIgnoreRobotsMeta()`
`boolean`	`isIgnoreRobotsTxt()`
`boolean`	`isIgnoreSitemap()`	Whether to ignore sitemap detection and resolving for URLs processed.
`boolean`	`isKeepDownloads()`
`boolean`	`isKeepOutOfScopeLinks()`	Deprecated. Since 3.0.0, use `getKeepReferencedLinks()`.
`boolean`	`isPostImportLinksKeep()`	Gets whether to keep the importer-generated field holding URLs to consider for crawling.
`boolean`	`isStartURLsAsync()`	Gets whether the start URLs should be loaded asynchronously.
`protected void`	`loadCrawlerConfigFromXML(XML xml)`
`protected void`	`saveCrawlerConfigToXML(XML xml)`
`void`	`setCanonicalLinkDetector(ICanonicalLinkDetector canonicalLinkDetector)`	Sets the canonical link detector.
`void`	`setDelayResolver(IDelayResolver delayResolver)`
`void`	`setFetchHttpGet(HttpCrawlerConfig.HttpMethodSupport fetchHttpGet)`	Sets whether to fetch HTTP documents using an HTTP GET request.
`void`	`setFetchHttpHead(boolean fetchHttpHead)`	Deprecated. Use `setFetchHttpHead(HttpMethodSupport)`.
`void`	`setFetchHttpHead(HttpCrawlerConfig.HttpMethodSupport fetchHttpHead)`	Sets whether to fetch HTTP response headers using an HTTP HEAD request.
`void`	`setHttpFetchers(IHttpFetcher... httpFetchers)`	Sets HTTP fetchers.
`void`	`setHttpFetchers(List<IHttpFetcher> httpFetchers)`	Sets HTTP fetchers.
`void`	`setHttpFetchersMaxRetries(int httpFetchersMaxRetries)`	Sets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures.
`void`	`setHttpFetchersRetryDelay(long httpFetchersRetryDelay)`	Sets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds).
`void`	`setIgnoreCanonicalLinks(boolean ignoreCanonicalLinks)`	Sets whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed.
`void`	`setIgnoreRobotsMeta(boolean ignoreRobotsMeta)`
`void`	`setIgnoreRobotsTxt(boolean ignoreRobotsTxt)`
`void`	`setIgnoreSitemap(boolean ignoreSitemap)`	Sets whether to ignore sitemap detection and resolving for URLs processed.
`void`	`setKeepDownloads(boolean keepDownloads)`
`void`	`setKeepOutOfScopeLinks(boolean keepOutOfScopeLinks)`	Deprecated. Since 3.0.0, use `setKeepReferencedLinks(Set)`.
`void`	`setKeepReferencedLinks(HttpCrawlerConfig.ReferencedLinkType... keepReferencedLinks)`	Sets whether to keep referenced links and what to keep.
`void`	`setKeepReferencedLinks(Set<HttpCrawlerConfig.ReferencedLinkType> keepReferencedLinks)`	Sets whether to keep referenced links and what to keep.
`void`	`setLinkExtractors(ILinkExtractor... linkExtractors)`	Sets link extractors.
`void`	`setLinkExtractors(List<ILinkExtractor> linkExtractors)`	Sets link extractors.
`void`	`setMaxDepth(int depth)`
`void`	`setPostImportLinks(TextMatcher fieldMatcher)`	Set a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.
`void`	`setPostImportLinksKeep(boolean postImportLinksKeep)`	Sets whether to keep the importer-generated field holding URLs to consider for crawling.
`void`	`setPostImportProcessors(IHttpDocumentProcessor... postImportProcessors)`	Sets post-import processors.
`void`	`setPostImportProcessors(List<IHttpDocumentProcessor> postImportProcessors)`	Sets post-import processors.
`void`	`setPreImportProcessors(IHttpDocumentProcessor... preImportProcessors)`	Sets pre-import processors.
`void`	`setPreImportProcessors(List<IHttpDocumentProcessor> preImportProcessors)`	Sets pre-import processors.
`void`	`setRecrawlableResolver(IRecrawlableResolver recrawlableResolver)`	Sets the recrawlable resolver.
`void`	`setRobotsMetaProvider(IRobotsMetaProvider robotsMetaProvider)`
`void`	`setRobotsTxtProvider(IRobotsTxtProvider robotsTxtProvider)`
`void`	`setSitemapResolver(ISitemapResolver sitemapResolver)`
`void`	`setStartSitemapURLs(String... startSitemapURLs)`	Sets the sitemap URLs used as starting points for crawling.
`void`	`setStartSitemapURLs(List<String> startSitemapURLs)`	Sets the sitemap URLs used as starting points for crawling.
`void`	`setStartURLs(String... startURLs)`	Sets URLs to initiate crawling from.
`void`	`setStartURLs(List<String> startURLs)`	Sets URLs to initiate crawling from.
`void`	`setStartURLsAsync(boolean asyncStartURLs)`	Sets whether the start URLs should be loaded asynchronously.
`void`	`setStartURLsFiles(Path... startURLsFiles)`	Sets the file paths of seed files containing URLs to be used as "start URLs".
`void`	`setStartURLsFiles(List<Path> startURLsFiles)`	Sets the file paths of seed files containing URLs to be used as "start URLs".
`void`	`setStartURLsProviders(IStartURLsProvider... startURLsProviders)`	Sets the providers of URLs used as starting points for crawling.
`void`	`setStartURLsProviders(List<IStartURLsProvider> startURLsProviders)`	Sets the providers of URLs used as starting points for crawling.
`void`	`setUrlCrawlScopeStrategy(URLCrawlScopeStrategy urlCrawlScopeStrategy)`	Sets the strategy to use to determine if a URL is in scope.
`void`	`setUrlNormalizer(IURLNormalizer urlNormalizer)`	Deprecated, for removal: This API element is subject to removal in a future version. Since 3.1.0, use `setUrlNormalizers(List)` instead.
`void`	`setUrlNormalizers(List<IURLNormalizer> urlNormalizers)`	Sets URL normalizers.
`String`	`toString()`

Methods inherited from class com.norconex.collector.core.crawler.CrawlerConfig
addEventListeners, addEventListeners, clearEventListeners, getCommitter, getCommitters, getDataStoreEngine, getDocumentChecksummer, getDocumentFilters, getEventListeners, getId, getImporterConfig, getMaxDocuments, getMetadataChecksummer, getMetadataFilters, getNumThreads, getOrphansStrategy, getReferenceFilters, getSpoiledReferenceStrategizer, getStopOnExceptions, isDocumentDeduplicate, isMetadataDeduplicate, loadFromXML, saveToXML, setCommitter, setCommitters, setCommitters, setDataStoreEngine, setDocumentChecksummer, setDocumentDeduplicate, setDocumentFilters, setDocumentFilters, setEventListeners, setEventListeners, setId, setImporterConfig, setMaxDocuments, setMetadataChecksummer, setMetadataDeduplicate, setMetadataFilters, setMetadataFilters, setNumThreads, setOrphansStrategy, setReferenceFilters, setReferenceFilters, setSpoiledReferenceStrategizer, setStopOnExceptions, setStopOnExceptions

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - HttpCrawlerConfig
```
public HttpCrawlerConfig()
```
- Method Detail
  - isFetchHttpHead
```
@Deprecated
public boolean isFetchHttpHead()
```
    Deprecated.
    Use getFetchHttpHead().
    
    Deprecated.
    
    Returns:
    
    true if fetching HTTP response headers separately
    
    Since:
    
    3.0.0-M1
  - setFetchHttpHead
```
@Deprecated
public void setFetchHttpHead(boolean fetchHttpHead)
```
    Deprecated.
    Use setFetchHttpHead(HttpMethodSupport).
    
    Deprecated.
    
    Parameters:
    
    fetchHttpHead - true if fetching HTTP response headers separately
    
    Since:
    
    3.0.0-M1
  - getFetchHttpHead
```
public HttpCrawlerConfig.HttpMethodSupport getFetchHttpHead()
```
    Gets whether to fetch HTTP response headers using an HTTP HEAD request. That HTTP request is performed separately from a document download request (HTTP "GET"). Useful when you need to filter documents based on HTTP header values, without downloading them first (e.g., to save bandwidth). When dealing with small documents on average, it may be best to avoid issuing two requests when a single one could do it.
    
    HttpCrawlerConfig.HttpMethodSupport.DISABLED by default. See class documentation for more details.
    
    Returns:
    
    HTTP HEAD method support
    
    Since:
    
    3.0.0
  - setFetchHttpHead
```
public void setFetchHttpHead(HttpCrawlerConfig.HttpMethodSupport fetchHttpHead)
```
    Sets whether to fetch HTTP response headers using an HTTP HEAD request.
    
    See class documentation for more details.
    
    Parameters:
    
    fetchHttpHead - HTTP HEAD method support
    
    Since:
    
    3.0.0
  - getFetchHttpGet
```
public HttpCrawlerConfig.HttpMethodSupport getFetchHttpGet()
```
    Gets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.
    
    HttpCrawlerConfig.HttpMethodSupport.REQUIRED by default. See class documentation for more details.
    
    Returns:
    
    true if fetching HTTP response headers separately
    
    Since:
    
    3.0.0
  - setFetchHttpGet
```
public void setFetchHttpGet(HttpCrawlerConfig.HttpMethodSupport fetchHttpGet)
```
    Sets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.
    
    See class documentation for more details.
    
    Parameters:
    
    fetchHttpGet - true if fetching HTTP response headers separately
    
    Since:
    
    3.0.0
  - getStartURLs
```
public List<String> getStartURLs()
```
    Gets URLs to initiate crawling from.
    
    Returns:
    
    start URLs (never null)
  - setStartURLs
```
public void setStartURLs(String... startURLs)
```
    Sets URLs to initiate crawling from.
    
    Parameters:
    
    startURLs - start URLs
  - setStartURLs
```
public void setStartURLs(List<String> startURLs)
```
    Sets URLs to initiate crawling from.
    
    Parameters:
    
    startURLs - start URLs
    
    Since:
    
    3.0.0
  - getStartURLsFiles
```
public List<Path> getStartURLsFiles()
```
    Gets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.
    
    Returns:
    
    file paths of seed files containing URLs (never null)
    
    Since:
    
    2.3.0
  - setStartURLsFiles
```
public void setStartURLsFiles(Path... startURLsFiles)
```
    Sets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.
    
    Parameters:
    
    startURLsFiles - file paths of seed files containing URLs
    
    Since:
    
    2.3.0
  - setStartURLsFiles
```
public void setStartURLsFiles(List<Path> startURLsFiles)
```
    Sets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.
    
    Parameters:
    
    startURLsFiles - file paths of seed files containing URLs
    
    Since:
    
    3.0.0
  - getStartSitemapURLs
```
public List<String> getStartSitemapURLs()
```
    Gets sitemap URLs to be used as starting points for crawling.
    
    Returns:
    
    sitemap URLs (never null)
    
    Since:
    
    2.3.0
  - setStartSitemapURLs
```
public void setStartSitemapURLs(String... startSitemapURLs)
```
    Sets the sitemap URLs used as starting points for crawling.
    
    Parameters:
    
    startSitemapURLs - sitemap URLs
    
    Since:
    
    2.3.0
  - setStartSitemapURLs
```
public void setStartSitemapURLs(List<String> startSitemapURLs)
```
    Sets the sitemap URLs used as starting points for crawling.
    
    Parameters:
    
    startSitemapURLs - sitemap URLs
    
    Since:
    
    3.0.0
  - getStartURLsProviders
```
public List<IStartURLsProvider> getStartURLsProviders()
```
    Gets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.
    
    Returns:
    
    start URL providers (never null)
    
    Since:
    
    2.7.0
  - setStartURLsProviders
```
public void setStartURLsProviders(IStartURLsProvider... startURLsProviders)
```
    Sets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.
    
    Parameters:
    
    startURLsProviders - start URL provider
    
    Since:
    
    2.7.0
  - setStartURLsProviders
```
public void setStartURLsProviders(List<IStartURLsProvider> startURLsProviders)
```
    Sets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.
    
    Parameters:
    
    startURLsProviders - start URL provider
    
    Since:
    
    3.0.0
  - isStartURLsAsync
```
public boolean isStartURLsAsync()
```
    Gets whether the start URLs should be loaded asynchronously. When true, the crawler will start processing URLs in the queue even if start URLs are still being loaded. While this may speed up crawling, it may have an unexpected effect on accuracy of HttpDocMetadata.DEPTH. Use of this option is only recommended when start URLs takes a significant time to load (e.g., large sitemaps).
    
    Returns:
    
    true if async.
    
    Since:
    
    3.0.0
  - setStartURLsAsync
```
public void setStartURLsAsync(boolean asyncStartURLs)
```
    Sets whether the start URLs should be loaded asynchronously. When true, the crawler will start processing URLs in the queue even if start URLs are still being loaded. While this may speed up crawling, it may have an unexpected effect on accuracy of HttpDocMetadata.DEPTH. Use of this option is only recommended when start URLs takes a significant time to load (e.g., large sitemaps).
    
    Parameters:
    
    asyncStartURLs - true if async.
    
    Since:
    
    3.0.0
  - setMaxDepth
```
public void setMaxDepth(int depth)
```
  - getMaxDepth
```
public int getMaxDepth()
```
  - getHttpFetchers
```
public List<IHttpFetcher> getHttpFetchers()
```
    Gets HTTP fetchers.
    
    Returns:
    
    start URLs (never null)
    
    Since:
    
    3.0.0
  - setHttpFetchers
```
public void setHttpFetchers(IHttpFetcher... httpFetchers)
```
    Sets HTTP fetchers.
    
    Parameters:
    
    httpFetchers - list of HTTP fetchers
    
    Since:
    
    3.0.0
  - setHttpFetchers
```
public void setHttpFetchers(List<IHttpFetcher> httpFetchers)
```
    Sets HTTP fetchers.
    
    Parameters:
    
    httpFetchers - list of HTTP fetchers
    
    Since:
    
    3.0.0
  - getHttpFetchersMaxRetries
```
public int getHttpFetchersMaxRetries()
```
    Gets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures. Default is zero (won't retry).
    
    Returns:
    
    number of times
    
    Since:
    
    3.0.0
  - setHttpFetchersMaxRetries
```
public void setHttpFetchersMaxRetries(int httpFetchersMaxRetries)
```
    Sets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures.
    
    Parameters:
    
    httpFetchersMaxRetries - maximum number of retries
    
    Since:
    
    3.0.0
  - getHttpFetchersRetryDelay
```
public long getHttpFetchersRetryDelay()
```
    Gets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds). Default is zero (no delay).
    
    Returns:
    
    retry delay
    
    Since:
    
    3.0.0
  - setHttpFetchersRetryDelay
```
public void setHttpFetchersRetryDelay(long httpFetchersRetryDelay)
```
    Sets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds).
    
    Parameters:
    
    httpFetchersRetryDelay - retry delay
    
    Since:
    
    3.0.0
  - getCanonicalLinkDetector
```
public ICanonicalLinkDetector getCanonicalLinkDetector()
```
    Gets the canonical link detector.
    
    Returns:
    
    the canonical link detector, or null if none are defined.
    
    Since:
    
    2.2.0
  - setCanonicalLinkDetector
```
public void setCanonicalLinkDetector(ICanonicalLinkDetector canonicalLinkDetector)
```
    Sets the canonical link detector. To disable canonical link detection, either pass a null argument, or invoke setIgnoreCanonicalLinks(boolean) with a true value.
    
    Parameters:
    
    canonicalLinkDetector - the canonical link detector
    
    Since:
    
    2.2.0
  - getLinkExtractors
```
public List<ILinkExtractor> getLinkExtractors()
```
    Gets link extractors.
    
    Returns:
    
    link extractors
  - setLinkExtractors
```
public void setLinkExtractors(ILinkExtractor... linkExtractors)
```
    Sets link extractors.
    
    Parameters:
    
    linkExtractors - link extractors
  - setLinkExtractors
```
public void setLinkExtractors(List<ILinkExtractor> linkExtractors)
```
    Sets link extractors.
    
    Parameters:
    
    linkExtractors - link extractors
    
    Since:
    
    3.0.0
  - getRobotsTxtProvider
```
public IRobotsTxtProvider getRobotsTxtProvider()
```
  - setRobotsTxtProvider
```
public void setRobotsTxtProvider(IRobotsTxtProvider robotsTxtProvider)
```
  - getUrlNormalizer
```
@Deprecated(forRemoval=true,
            since="3.1.0")
public IURLNormalizer getUrlNormalizer()
```
    Deprecated, for removal: This API element is subject to removal in a future version.
    Since 3.1.0, use getUrlNormalizers() instead.
    
    Returns:
    
    URL normalizer
  - setUrlNormalizer
```
@Deprecated(forRemoval=true,
            since="3.1.0")
public void setUrlNormalizer(IURLNormalizer urlNormalizer)
```
    Deprecated, for removal: This API element is subject to removal in a future version.
    Since 3.1.0, use setUrlNormalizers(List) instead.
    
    Parameters:
    
    urlNormalizer - URL normalizer
  - getUrlNormalizers
```
public List<IURLNormalizer> getUrlNormalizers()
```
    Gets URL normalizers. Defaults to a single GenericURLNormalizer instance (with its default configuration).
    
    Returns:
    
    URL normalizers or an empty list (never null)
    
    Since:
    
    3.1.0
  - setUrlNormalizers
```
public void setUrlNormalizers(List<IURLNormalizer> urlNormalizers)
```
    Sets URL normalizers.
    
    Parameters:
    
    urlNormalizers - URL normalizers
    
    Since:
    
    3.1.0
  - getDelayResolver
```
public IDelayResolver getDelayResolver()
```
  - setDelayResolver
```
public void setDelayResolver(IDelayResolver delayResolver)
```
  - getPreImportProcessors
```
public List<IHttpDocumentProcessor> getPreImportProcessors()
```
    Gets pre-import processors.
    
    Returns:
    
    pre-import processors
  - setPreImportProcessors
```
public void setPreImportProcessors(IHttpDocumentProcessor... preImportProcessors)
```
    Sets pre-import processors.
    
    Parameters:
    
    preImportProcessors - pre-import processors
  - setPreImportProcessors
```
public void setPreImportProcessors(List<IHttpDocumentProcessor> preImportProcessors)
```
    Sets pre-import processors.
    
    Parameters:
    
    preImportProcessors - pre-import processors
    
    Since:
    
    3.0.0
  - getPostImportProcessors
```
public List<IHttpDocumentProcessor> getPostImportProcessors()
```
    Gets post-import processors.
    
    Returns:
    
    post-import processors
  - setPostImportProcessors
```
public void setPostImportProcessors(IHttpDocumentProcessor... postImportProcessors)
```
    Sets post-import processors.
    
    Parameters:
    
    postImportProcessors - post-import processors
  - setPostImportProcessors
```
public void setPostImportProcessors(List<IHttpDocumentProcessor> postImportProcessors)
```
    Sets post-import processors.
    
    Parameters:
    
    postImportProcessors - post-import processors
    
    Since:
    
    3.0.0
  - isIgnoreRobotsTxt
```
public boolean isIgnoreRobotsTxt()
```
  - setIgnoreRobotsTxt
```
public void setIgnoreRobotsTxt(boolean ignoreRobotsTxt)
```
  - isKeepDownloads
```
public boolean isKeepDownloads()
```
  - setKeepDownloads
```
public void setKeepDownloads(boolean keepDownloads)
```
  - isKeepOutOfScopeLinks
```
@Deprecated
public boolean isKeepOutOfScopeLinks()
```
    Deprecated.
    Since 3.0.0, use getKeepReferencedLinks().
    
    Whether links not in scope should be stored as metadata under HttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE
    
    Returns:
    
    true if keeping URLs not in scope.
    
    Since:
    
    2.8.0
  - setKeepOutOfScopeLinks
```
@Deprecated
public void setKeepOutOfScopeLinks(boolean keepOutOfScopeLinks)
```
    Deprecated.
    Since 3.0.0, use setKeepReferencedLinks(Set).
    
    Sets whether links not in scope should be stored as metadata under HttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE
    
    Parameters:
    
    keepOutOfScopeLinks - true if keeping URLs not in scope
    
    Since:
    
    2.8.0
  - getKeepReferencedLinks
```
public Set<HttpCrawlerConfig.ReferencedLinkType> getKeepReferencedLinks()
```
    Gets what type of referenced links to keep, if any. Those links are URLs extracted by link extractors. See class documentation for more details.
    
    Returns:
    
    preferences for keeping links
    
    Since:
    
    3.0.0
  - setKeepReferencedLinks
```
public void setKeepReferencedLinks(Set<HttpCrawlerConfig.ReferencedLinkType> keepReferencedLinks)
```
    Sets whether to keep referenced links and what to keep. Those links are URLs extracted by link extractors. See class documentation for more details.
    
    Parameters:
    
    keepReferencedLinks - option for keeping links
    
    Since:
    
    3.0.0
  - setKeepReferencedLinks
```
public void setKeepReferencedLinks(HttpCrawlerConfig.ReferencedLinkType... keepReferencedLinks)
```
    Sets whether to keep referenced links and what to keep. Those links are URLs extracted by link extractors. See class documentation for more details.
    
    Parameters:
    
    keepReferencedLinks - option for keeping links
    
    Since:
    
    3.0.0
  - isIgnoreRobotsMeta
```
public boolean isIgnoreRobotsMeta()
```
  - setIgnoreRobotsMeta
```
public void setIgnoreRobotsMeta(boolean ignoreRobotsMeta)
```
  - getRobotsMetaProvider
```
public IRobotsMetaProvider getRobotsMetaProvider()
```
  - setRobotsMetaProvider
```
public void setRobotsMetaProvider(IRobotsMetaProvider robotsMetaProvider)
```
  - isIgnoreSitemap
```
public boolean isIgnoreSitemap()
```
    Whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()) are never ignored.
    
    Returns:
    
    true to ignore sitemaps
  - setIgnoreSitemap
```
public void setIgnoreSitemap(boolean ignoreSitemap)
```
    Sets whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()) are never ignored.
    
    Parameters:
    
    ignoreSitemap - true to ignore sitemaps
  - getSitemapResolver
```
public ISitemapResolver getSitemapResolver()
```
  - setSitemapResolver
```
public void setSitemapResolver(ISitemapResolver sitemapResolver)
```
  - isIgnoreCanonicalLinks
```
public boolean isIgnoreCanonicalLinks()
```
    Whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. When processed (default), URL pages with a canonical URL pointer in them are not processed.
    
    Returns:
    
    true if ignoring canonical links processed.
    
    Since:
    
    2.2.0
  - setIgnoreCanonicalLinks
```
public void setIgnoreCanonicalLinks(boolean ignoreCanonicalLinks)
```
    Sets whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. If true URL pages with a canonical URL pointer in them are not
    
    Parameters:
    
    ignoreCanonicalLinks - true if ignoring canonical links
    
    Since:
    
    2.2.0
  - getURLCrawlScopeStrategy
```
public URLCrawlScopeStrategy getURLCrawlScopeStrategy()
```
    Gets the strategy to use to determine if a URL is in scope.
    
    Returns:
    
    the strategy
  - setUrlCrawlScopeStrategy
```
public void setUrlCrawlScopeStrategy(URLCrawlScopeStrategy urlCrawlScopeStrategy)
```
    Sets the strategy to use to determine if a URL is in scope.
    
    Parameters:
    
    urlCrawlScopeStrategy - strategy to use
    
    Since:
    
    2.8.1
  - getRecrawlableResolver
```
public IRecrawlableResolver getRecrawlableResolver()
```
    Gets the recrawlable resolver.
    
    Returns:
    
    recrawlable resolver
    
    Since:
    
    2.5.0
  - setRecrawlableResolver
```
public void setRecrawlableResolver(IRecrawlableResolver recrawlableResolver)
```
    Sets the recrawlable resolver.
    
    Parameters:
    
    recrawlableResolver - the recrawlable resolver
    
    Since:
    
    2.5.0
  - getPostImportLinks
```
public TextMatcher getPostImportLinks()
```
    Gets a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.
    
    Returns:
    
    field matcher
    
    Since:
    
    3.0.0
  - setPostImportLinks
```
public void setPostImportLinks(TextMatcher fieldMatcher)
```
    Set a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.
    
    Parameters:
    
    fieldMatcher - field matcher
    
    Since:
    
    3.0.0
  - isPostImportLinksKeep
```
public boolean isPostImportLinksKeep()
```
    Gets whether to keep the importer-generated field holding URLs to consider for crawling.
    
    Returns:
    
    true if keeping
    
    Since:
    
    3.0.0
  - setPostImportLinksKeep
```
public void setPostImportLinksKeep(boolean postImportLinksKeep)
```
    Sets whether to keep the importer-generated field holding URLs to consider for crawling.
    
    Parameters:
    
    postImportLinksKeep - true if keeping
    
    Since:
    
    3.0.0
  - saveCrawlerConfigToXML
```
protected void saveCrawlerConfigToXML(XML xml)
```
    Specified by:
    
    saveCrawlerConfigToXML in class CrawlerConfig
  - loadCrawlerConfigFromXML
```
protected void loadCrawlerConfigFromXML(XML xml)
```
    Specified by:
    
    loadCrawlerConfigFromXML in class CrawlerConfig
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class CrawlerConfig
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class CrawlerConfig
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class CrawlerConfig

Modifier and Type	Class	Description
`static class`	`HttpCrawlerConfig.HttpMethodSupport`
`static class`	`HttpCrawlerConfig.ReferencedLinkType`

Class HttpCrawlerConfig

Start URLs

URL Normalization

Crawl Speed

Crawl Depth

Keeping downloaded files

Keeping Referenced Links

Orphan documents

Error Handling

Crawler Events

Data Store (Cache)

Document Importing

Bad Documents

Committing Documents

HTTP Fetcher

HTTP Methods

Filtering Unwanted Documents

Robot Directives

Re-crawl Frequency

Change Detection (Checksums)

Deduplication

URL Extraction

XML configuration usage:

Nested Class Summary

Nested classes/interfaces inherited from class com.norconex.collector.core.crawler.CrawlerConfig

Constructor Summary

Method Summary

Methods inherited from class com.norconex.collector.core.crawler.CrawlerConfig

Methods inherited from class java.lang.Object

Constructor Detail

HttpCrawlerConfig

Method Detail

isFetchHttpHead

setFetchHttpHead

getFetchHttpHead

setFetchHttpHead

getFetchHttpGet

setFetchHttpGet

getStartURLs

setStartURLs

setStartURLs

getStartURLsFiles

setStartURLsFiles

setStartURLsFiles

getStartSitemapURLs

setStartSitemapURLs

setStartSitemapURLs

getStartURLsProviders

setStartURLsProviders

setStartURLsProviders

isStartURLsAsync

setStartURLsAsync

setMaxDepth

getMaxDepth

getHttpFetchers

setHttpFetchers

setHttpFetchers

getHttpFetchersMaxRetries

setHttpFetchersMaxRetries

getHttpFetchersRetryDelay

setHttpFetchersRetryDelay

getCanonicalLinkDetector

setCanonicalLinkDetector

getLinkExtractors

setLinkExtractors

setLinkExtractors

getRobotsTxtProvider

setRobotsTxtProvider

getUrlNormalizer

setUrlNormalizer

getUrlNormalizers

setUrlNormalizers

getDelayResolver

setDelayResolver

getPreImportProcessors

setPreImportProcessors

setPreImportProcessors

getPostImportProcessors

setPostImportProcessors

setPostImportProcessors