HttpCrawlerConfig (Norconex HTTP Collector 2.9.1 API)

java.lang.Object
- com.norconex.collector.core.crawler.AbstractCrawlerConfig
- - com.norconex.collector.http.crawler.HttpCrawlerConfig

All Implemented Interfaces:

ICrawlerConfig, IXMLConfigurable
```
public class HttpCrawlerConfig
extends AbstractCrawlerConfig
```
HTTP Crawler configuration.

Author:

Pascal Essiembre

Nested Class Summary
- Nested classes/interfaces inherited from interface com.norconex.collector.core.crawler.ICrawlerConfig
  ICrawlerConfig.OrphansStrategy

Constructor Summary

Constructors
Constructor and Description

HttpCrawlerConfig()

Constructors
Constructor and Description
`HttpCrawlerConfig()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`equals(Object other)`
`ICanonicalLinkDetector`	`getCanonicalLinkDetector()` Gets the canonical link detector.
`IDelayResolver`	`getDelayResolver()`
`IHttpDocumentFetcher`	`getDocumentFetcher()`
`IHttpClientFactory`	`getHttpClientFactory()`
`ILinkExtractor[]`	`getLinkExtractors()`
`int`	`getMaxDepth()`
`IMetadataChecksummer`	`getMetadataChecksummer()` Gets the metadata checksummer.
`IHttpMetadataFetcher`	`getMetadataFetcher()`
`IHttpDocumentProcessor[]`	`getPostImportProcessors()`
`IHttpDocumentProcessor[]`	`getPreImportProcessors()`
`IRecrawlableResolver`	`getRecrawlableResolver()` Gets the recrawlable resolver.
`IRedirectURLProvider`	`getRedirectURLProvider()` Gets the redirect URL provider.
`IRobotsMetaProvider`	`getRobotsMetaProvider()`
`IRobotsTxtProvider`	`getRobotsTxtProvider()`
`ISitemapResolverFactory`	`getSitemapResolverFactory()`
`String[]`	`getStartSitemapURLs()` Gets sitemap URLs to be used as starting points for crawling.
`String[]`	`getStartURLs()`
`String[]`	`getStartURLsFiles()` Gets the file paths of seed files containing URLs to be used as "start URLs".
`IStartURLsProvider[]`	`getStartURLsProviders()` Gets the providers of URLs used as starting points for crawling.
`URLCrawlScopeStrategy`	`getURLCrawlScopeStrategy()` Gets the strategy to use to determine if a URL is in scope.
`IURLNormalizer`	`getUrlNormalizer()`
`String`	`getUserAgent()`
`int`	`hashCode()`
`boolean`	`isIgnoreCanonicalLinks()` Whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed.
`boolean`	`isIgnoreRobotsMeta()`
`boolean`	`isIgnoreRobotsTxt()`
`boolean`	`isIgnoreSitemap()` Whether to ignore sitemap detection and resolving for URLs processed.
`boolean`	`isKeepDownloads()`
`boolean`	`isKeepMaxDepthLinks()` Gets whether to keep (and extract) links on pages having reached the configured maximum depth.
`boolean`	`isKeepOutOfScopeLinks()` Whether links not in scope should be stored as metadata under `HttpMetadata.COLLECTOR_REFERENCED_URLS_OUT_OF_SCOPE`
`boolean`	`isSkipMetaFetcherOnBadStatus()` Gets whether to skip metadata fetching activities instead of rejecting a document on bad status.
`protected void`	`loadCrawlerConfigFromXML(XMLConfiguration xml)`
`protected void`	`saveCrawlerConfigToXML(Writer out)`
`void`	`setCanonicalLinkDetector(ICanonicalLinkDetector canonicalLinkDetector)` Sets the canonical link detector.
`void`	`setDelayResolver(IDelayResolver delayResolver)`
`void`	`setDocumentFetcher(IHttpDocumentFetcher documentFetcher)`
`void`	`setHttpClientFactory(IHttpClientFactory httpClientFactory)`
`void`	`setIgnoreCanonicalLinks(boolean ignoreCanonicalLinks)` Sets whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed.
`void`	`setIgnoreRobotsMeta(boolean ignoreRobotsMeta)`
`void`	`setIgnoreRobotsTxt(boolean ignoreRobotsTxt)`
`void`	`setIgnoreSitemap(boolean ignoreSitemap)` Sets whether to ignore sitemap detection and resolving for URLs processed.
`void`	`setKeepDownloads(boolean keepDownloads)`
`void`	`setKeepMaxDepthLinks(boolean keepMaxDepthLinks)` Sets whether to keep (and extract) links on pages having reached the configured maximum depth.
`void`	`setKeepOutOfScopeLinks(boolean keepOutOfScopeLinks)` Sets whether links not in scope should be stored as metadata under `HttpMetadata.COLLECTOR_REFERENCED_URLS_OUT_OF_SCOPE`
`void`	`setLinkExtractors(ILinkExtractor... linkExtractors)`
`void`	`setMaxDepth(int depth)`
`void`	`setMetadataChecksummer(IMetadataChecksummer metadataChecksummer)`
`void`	`setMetadataFetcher(IHttpMetadataFetcher metadataFetcher)`
`void`	`setPostImportProcessors(IHttpDocumentProcessor... httpPostProcessors)`
`void`	`setPreImportProcessors(IHttpDocumentProcessor... httpPreProcessors)`
`void`	`setRecrawlableResolver(IRecrawlableResolver recrawlableResolver)` Sets the recrawlable resolver.
`void`	`setRedirectURLProvider(IRedirectURLProvider redirectURLProvider)` Sets the redirect URL provider
`void`	`setRobotsMetaProvider(IRobotsMetaProvider robotsMetaProvider)`
`void`	`setRobotsTxtProvider(IRobotsTxtProvider robotsTxtProvider)`
`void`	`setSitemapResolverFactory(ISitemapResolverFactory sitemapResolverFactory)`
`void`	`setSkipMetaFetcherOnBadStatus(boolean skipMetaFetcherOnBadStatus)` Sets whether to skip metadata fetching activities instead of rejecting a document on bad status.
`void`	`setStartSitemapURLs(String... startSitemapURLs)` Sets the sitemap URLs used as starting points for crawling.
`void`	`setStartURLs(String... startURLs)`
`void`	`setStartURLsFiles(String... startURLsFiles)` Sets the file paths of seed files containing URLs to be used as "start URLs".
`void`	`setStartURLsProviders(IStartURLsProvider... startURLsProviders)` Sets the providers of URLs used as starting points for crawling.
`void`	`setUrlCrawlScopeStrategy(URLCrawlScopeStrategy urlCrawlScopeStrategy)` Sets the strategy to use to determine if a URL is in scope.
`void`	`setUrlNormalizer(IURLNormalizer urlNormalizer)`
`void`	`setUserAgent(String userAgent)`
`String`	`toString()`

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - HttpCrawlerConfig
```
public HttpCrawlerConfig()
```
- Method Detail
  - getStartURLs
```
public String[] getStartURLs()
```
  - setStartURLs
```
public void setStartURLs(String... startURLs)
```
  - getStartURLsFiles
```
public String[] getStartURLsFiles()
```
    Gets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.
    
    Returns:
    
    file paths of seed files containing URLs
    
    Since:
    
    2.3.0
  - setStartURLsFiles
```
public void setStartURLsFiles(String... startURLsFiles)
```
    Sets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.
    
    Parameters:
    
    startURLsFiles - file paths of seed files containing URLs
    
    Since:
    
    2.3.0
  - getStartSitemapURLs
```
public String[] getStartSitemapURLs()
```
    Gets sitemap URLs to be used as starting points for crawling.
    
    Returns:
    
    sitemap URLs
    
    Since:
    
    2.3.0
  - setStartSitemapURLs
```
public void setStartSitemapURLs(String... startSitemapURLs)
```
    Sets the sitemap URLs used as starting points for crawling.
    
    Parameters:
    
    startSitemapURLs - sitemap URLs
    
    Since:
    
    2.3.0
  - getStartURLsProviders
```
public IStartURLsProvider[] getStartURLsProviders()
```
    Gets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.
    
    Returns:
    
    a start URL provider
    
    Since:
    
    2.7.0
  - setStartURLsProviders
```
public void setStartURLsProviders(IStartURLsProvider... startURLsProviders)
```
    Sets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.
    
    Parameters:
    
    startURLsProviders - start URL provider
    
    Since:
    
    2.7.0
  - setMaxDepth
```
public void setMaxDepth(int depth)
```
  - getMaxDepth
```
public int getMaxDepth()
```
  - getHttpClientFactory
```
public IHttpClientFactory getHttpClientFactory()
```
  - setHttpClientFactory
```
public void setHttpClientFactory(IHttpClientFactory httpClientFactory)
```
  - getDocumentFetcher
```
public IHttpDocumentFetcher getDocumentFetcher()
```
  - setDocumentFetcher
```
public void setDocumentFetcher(IHttpDocumentFetcher documentFetcher)
```
  - getMetadataFetcher
```
public IHttpMetadataFetcher getMetadataFetcher()
```
  - setMetadataFetcher
```
public void setMetadataFetcher(IHttpMetadataFetcher metadataFetcher)
```
  - getCanonicalLinkDetector
```
public ICanonicalLinkDetector getCanonicalLinkDetector()
```
    Gets the canonical link detector.
    
    Returns:
    
    the canonical link detector, or null if none are defined.
    
    Since:
    
    2.2.0
  - setCanonicalLinkDetector
```
public void setCanonicalLinkDetector(ICanonicalLinkDetector canonicalLinkDetector)
```
    Sets the canonical link detector. To disable canonical link detection, either pass a null argument, or invoke setIgnoreCanonicalLinks(boolean) with a true value.
    
    Parameters:
    
    canonicalLinkDetector - the canonical link detector
    
    Since:
    
    2.2.0
  - getLinkExtractors
```
public ILinkExtractor[] getLinkExtractors()
```
  - setLinkExtractors
```
public void setLinkExtractors(ILinkExtractor... linkExtractors)
```
  - getRobotsTxtProvider
```
public IRobotsTxtProvider getRobotsTxtProvider()
```
  - setRobotsTxtProvider
```
public void setRobotsTxtProvider(IRobotsTxtProvider robotsTxtProvider)
```
  - getUrlNormalizer
```
public IURLNormalizer getUrlNormalizer()
```
  - setUrlNormalizer
```
public void setUrlNormalizer(IURLNormalizer urlNormalizer)
```
  - getDelayResolver
```
public IDelayResolver getDelayResolver()
```
  - setDelayResolver
```
public void setDelayResolver(IDelayResolver delayResolver)
```
  - getPreImportProcessors
```
public IHttpDocumentProcessor[] getPreImportProcessors()
```
  - setPreImportProcessors
```
public void setPreImportProcessors(IHttpDocumentProcessor... httpPreProcessors)
```
  - getPostImportProcessors
```
public IHttpDocumentProcessor[] getPostImportProcessors()
```
  - setPostImportProcessors
```
public void setPostImportProcessors(IHttpDocumentProcessor... httpPostProcessors)
```
  - isIgnoreRobotsTxt
```
public boolean isIgnoreRobotsTxt()
```
  - setIgnoreRobotsTxt
```
public void setIgnoreRobotsTxt(boolean ignoreRobotsTxt)
```
  - isKeepDownloads
```
public boolean isKeepDownloads()
```
  - setKeepDownloads
```
public void setKeepDownloads(boolean keepDownloads)
```
  - isKeepMaxDepthLinks
```
public boolean isKeepMaxDepthLinks()
```
    Gets whether to keep (and extract) links on pages having reached the configured maximum depth.
    
    Returns:
    
    true if keeping max depth links.
    
    Since:
    
    2.9.1
    
    See Also:
    
    getMaxDepth()
  - setKeepMaxDepthLinks
```
public void setKeepMaxDepthLinks(boolean keepMaxDepthLinks)
```
    Sets whether to keep (and extract) links on pages having reached the configured maximum depth.
    
    Parameters:
    
    keepMaxDepthLinks - true to keep max depth links.
    
    Since:
    
    2.9.1
    
    See Also:
    
    setMaxDepth(int)
  - isKeepOutOfScopeLinks
```
public boolean isKeepOutOfScopeLinks()
```
    Whether links not in scope should be stored as metadata under HttpMetadata.COLLECTOR_REFERENCED_URLS_OUT_OF_SCOPE
    
    Returns:
    
    true if keeping URLs not in scope.
    
    Since:
    
    2.8.0
  - setKeepOutOfScopeLinks
```
public void setKeepOutOfScopeLinks(boolean keepOutOfScopeLinks)
```
    Sets whether links not in scope should be stored as metadata under HttpMetadata.COLLECTOR_REFERENCED_URLS_OUT_OF_SCOPE
    
    Parameters:
    
    keepOutOfScopeLinks - true if keeping URLs not in scope
    
    Since:
    
    2.8.0
  - getMetadataChecksummer
```
public IMetadataChecksummer getMetadataChecksummer()
```
    Gets the metadata checksummer. Default implementation is LastModifiedMetadataChecksummer (since 2.2.0).
    
    Returns:
    
    metadata checksummer
  - setMetadataChecksummer
```
public void setMetadataChecksummer(IMetadataChecksummer metadataChecksummer)
```
  - isIgnoreRobotsMeta
```
public boolean isIgnoreRobotsMeta()
```
  - setIgnoreRobotsMeta
```
public void setIgnoreRobotsMeta(boolean ignoreRobotsMeta)
```
  - getRobotsMetaProvider
```
public IRobotsMetaProvider getRobotsMetaProvider()
```
  - setRobotsMetaProvider
```
public void setRobotsMetaProvider(IRobotsMetaProvider robotsMetaProvider)
```
  - isIgnoreSitemap
```
public boolean isIgnoreSitemap()
```
    Whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()) are never ignored.
    
    Returns:
    
    true to ignore sitemaps
  - setIgnoreSitemap
```
public void setIgnoreSitemap(boolean ignoreSitemap)
```
    Sets whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()) are never ignored.
    
    Parameters:
    
    ignoreSitemap - true to ignore sitemaps
  - getSitemapResolverFactory
```
public ISitemapResolverFactory getSitemapResolverFactory()
```
  - setSitemapResolverFactory
```
public void setSitemapResolverFactory(ISitemapResolverFactory sitemapResolverFactory)
```
  - getUserAgent
```
public String getUserAgent()
```
  - setUserAgent
```
public void setUserAgent(String userAgent)
```
  - isIgnoreCanonicalLinks
```
public boolean isIgnoreCanonicalLinks()
```
    Whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. When processed (default), URL pages with a canonical URL pointer in them are not processed.
    
    Returns:
    
    true if ignoring canonical links
    
    Since:
    
    2.2.0
  - setIgnoreCanonicalLinks
```
public void setIgnoreCanonicalLinks(boolean ignoreCanonicalLinks)
```
    Sets whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. If true URL pages with a canonical URL pointer in them are not processed.
    
    Parameters:
    
    ignoreCanonicalLinks - true if ignoring canonical links
    
    Since:
    
    2.2.0
  - getURLCrawlScopeStrategy
```
public URLCrawlScopeStrategy getURLCrawlScopeStrategy()
```
    Gets the strategy to use to determine if a URL is in scope.
    
    Returns:
    
    the strategy
  - setUrlCrawlScopeStrategy
```
public void setUrlCrawlScopeStrategy(URLCrawlScopeStrategy urlCrawlScopeStrategy)
```
    Sets the strategy to use to determine if a URL is in scope.
    
    Parameters:
    
    urlCrawlScopeStrategy - strategy to use
    
    Since:
    
    2.8.1
  - getRedirectURLProvider
```
public IRedirectURLProvider getRedirectURLProvider()
```
    Gets the redirect URL provider.
    
    Returns:
    
    the redirect URL provider
    
    Since:
    
    2.4.0
  - setRedirectURLProvider
```
public void setRedirectURLProvider(IRedirectURLProvider redirectURLProvider)
```
    Sets the redirect URL provider
    
    Parameters:
    
    redirectURLProvider - redirect URL provider
    
    Since:
    
    2.4.0
  - getRecrawlableResolver
```
public IRecrawlableResolver getRecrawlableResolver()
```
    Gets the recrawlable resolver.
    
    Returns:
    
    recrawlable resolver
    
    Since:
    
    2.5.0
  - setRecrawlableResolver
```
public void setRecrawlableResolver(IRecrawlableResolver recrawlableResolver)
```
    Sets the recrawlable resolver.
    
    Parameters:
    
    recrawlableResolver - the recrawlable resolver
    
    Since:
    
    2.5.0
  - isSkipMetaFetcherOnBadStatus
```
public boolean isSkipMetaFetcherOnBadStatus()
```
    Gets whether to skip metadata fetching activities instead of rejecting a document on bad status.
    
    Returns:
    
    true if skipping
    
    Since:
    
    2.9.1
  - setSkipMetaFetcherOnBadStatus
```
public void setSkipMetaFetcherOnBadStatus(boolean skipMetaFetcherOnBadStatus)
```
    Sets whether to skip metadata fetching activities instead of rejecting a document on bad status. If true, upon receiving a bad HTTP status code, activities such as metadata filtering, canonical URL resolution and metadata checksum creation are all skipped. When applicable, those activites will be performed after the document fetcher also had a chance to download metadata. Setting this flag to true can be useful when the HTTP HEAD method is not supported by some sites or pages.
    
    Parameters:
    
    skipMetaFetcherOnBadStatus - true if skipping
    
    Since:
    
    2.9.1
  - saveCrawlerConfigToXML
```
protected void saveCrawlerConfigToXML(Writer out)
                               throws IOException
```
    Specified by:
    
    saveCrawlerConfigToXML in class AbstractCrawlerConfig
    
    Throws:
    
    IOException
  - loadCrawlerConfigFromXML
```
protected void loadCrawlerConfigFromXML(XMLConfiguration xml)
```
    Specified by:
    
    loadCrawlerConfigFromXML in class AbstractCrawlerConfig
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractCrawlerConfig
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractCrawlerConfig
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractCrawlerConfig

Class HttpCrawlerConfig

Nested Class Summary

Nested classes/interfaces inherited from interface com.norconex.collector.core.crawler.ICrawlerConfig

Constructor Summary

Method Summary

Methods inherited from class com.norconex.collector.core.crawler.AbstractCrawlerConfig

Methods inherited from class java.lang.Object

Constructor Detail

HttpCrawlerConfig

Method Detail

getStartURLs

setStartURLs

getStartURLsFiles

setStartURLsFiles

getStartSitemapURLs

setStartSitemapURLs

getStartURLsProviders

setStartURLsProviders

setMaxDepth

getMaxDepth

getHttpClientFactory

setHttpClientFactory

getDocumentFetcher

setDocumentFetcher

getMetadataFetcher

setMetadataFetcher

getCanonicalLinkDetector

setCanonicalLinkDetector

getLinkExtractors

setLinkExtractors

getRobotsTxtProvider

setRobotsTxtProvider

getUrlNormalizer

setUrlNormalizer

getDelayResolver

setDelayResolver

getPreImportProcessors

setPreImportProcessors

getPostImportProcessors

setPostImportProcessors

isIgnoreRobotsTxt

setIgnoreRobotsTxt

isKeepDownloads

setKeepDownloads

isKeepMaxDepthLinks

setKeepMaxDepthLinks

isKeepOutOfScopeLinks

setKeepOutOfScopeLinks

getMetadataChecksummer

setMetadataChecksummer

isIgnoreRobotsMeta

setIgnoreRobotsMeta

getRobotsMetaProvider

setRobotsMetaProvider

isIgnoreSitemap

setIgnoreSitemap

getSitemapResolverFactory

setSitemapResolverFactory

getUserAgent

setUserAgent

isIgnoreCanonicalLinks

setIgnoreCanonicalLinks

getURLCrawlScopeStrategy

setUrlCrawlScopeStrategy

getRedirectURLProvider

setRedirectURLProvider

getRecrawlableResolver

setRecrawlableResolver

isSkipMetaFetcherOnBadStatus

setSkipMetaFetcherOnBadStatus

saveCrawlerConfigToXML

loadCrawlerConfigFromXML

equals

hashCode

toString