Class HttpCrawlerConfig

java.lang.Object
com.norconex.collector.core.crawler.CrawlerConfig
com.norconex.collector.http.crawler.HttpCrawlerConfig
All Implemented Interfaces:
IXMLConfigurable

public class HttpCrawlerConfig extends CrawlerConfig

HTTP Crawler configuration.

Start URLs

Crawling begins with one or more "start" URLs. Multiple start URLs can be defined, in a combination of ways:

Scope: To limit crawling to specific web domains, and avoid creating many filters to that effect, you can tell the crawler to "stay" within the web site "scope" with setUrlCrawlScopeStrategy(URLCrawlScopeStrategy).

URL Normalization

Pages on web sites are often referenced using different URL patterns. Such URL variations can fool the crawler into downloading the same document multiple times. To avoid this, URLs are "normalized". That is, they are converted so they are always formulated the same way. By default, the crawler only applies normalization in ways that are semantically equivalent (see GenericURLNormalizer).

Crawl Speed

Be kind to web sites you crawl. Being too aggressive can be perceived as a cyber-attack by the targeted web site (e.g., DoS attack). This can lead to your crawler being blocked.

For this reason, the crawler plays nice by default. It will wait a few seconds between each page download, regardless of the maximum number of threads specified or whether pages crawled are on different web sites. This can of course be changed to be as fast as you want. See GenericDelayResolver) for changing default options. You can also provide your own "delay resolver" by supplying a class implementing IDelayResolver.

Crawl Depth

The crawl depth represents how many level from the start URL the crawler goes. From a browser user perspective, it can be seen as the number of link "clicks" required from a start URL in order to get to a specific page. The crawler will crawl as deep for as long as it discovers new URLs not getting rejected by your configuration. This is not always desirable. For instance, a web site could have dynamically generated URLs with infinite possibilities (e.g., dynamically generated web calendars). To avoid infinite crawls, it is recommended to limit the maximum depth to something reasonable for your site with setMaxDepth(int).

Keeping downloaded files

Downloaded files are deleted after being processed. Set setKeepDownloads(boolean) to true in order to preserve them. Files will be kept under a new "downloads" folder found under your working directory. Keep in mind this is not a method for cloning a site. Use with caution on large sites as it can quickly fill up the local disk space.

Keeping Referenced Links

By default the crawler stores, as metadata, URLs extracted from documents that are in scope. Exceptions are pages discovered at the configured maximum depth (setMaxDepth(int)). This can be changed using the setKeepReferencedLinks(Set) method. Changing this setting has no incidence on what page gets crawled. Possible options are:

Orphan documents

Orphans are valid documents, which on subsequent crawls can no longer be reached (e.g. there are no longer referenced). This is regardless whether the file has been deleted or not at the source. You can tell the crawler how to handle those with CrawlerConfig.setOrphansStrategy(OrphansStrategy). Possible options are:

  • PROCESS: Default. Tries to crawl orphans normally as if they were still reachable by the crawler.
  • IGNORE: Does nothing with orphans (not deleted, not processed)..
  • DELETE: Orphans are sent to your Committer for deletion.

Error Handling

By default the crawler logs exceptions while trying to prevent them from terminating a crawling session. There might be cases where you want the crawler to halt upon encountering some types of exceptions. You can do so with CrawlerConfig.setStopOnExceptions(List).

Crawler Events

The crawler fires all kind of events to notify interested parties of such things as when a document is rejected, imported, committed, etc.). You can listen to crawler events using CrawlerConfig.setEventListeners(List).

Data Store (Cache)

During and between crawl sessions, the crawler needs to preserve specific information in order to keep track of things such as a queue of document reference to process, those already processed, whether a document has been modified since last crawled, caching of document checksums, etc. For this, the crawler uses a database we call a crawl data store engine. The default implementation uses the local file system to store these (see MVStoreDataStoreEngine). While very capable and suitable for most sites, if you need a larger storage system, you can provide your own implementation with CrawlerConfig.setDataStoreEngine(IDataStoreEngine).

Document Importing

The process of transforming, enhancing, parsing to extracting plain text and many other document-specific processing activities are handled by the Norconex Importer module. See ImporterConfig for many additional configuration options.

Bad Documents

On a fresh crawl, documents that are unreachable or not obtained successfully for some reason are simply logged and ignored. On the other hand, documents that were successfully crawled once and are suddenly failing on a subsequent crawl are considered "spoiled". You can decide whether to grace (retry next time), delete, or ignore those spoiled documents with CrawlerConfig.setSpoiledReferenceStrategizer(ISpoiledReferenceStrategizer).

Committing Documents

The last step of a successful processing of a document is to store it in your preferred target repository (or repositories). For this to happen, you have to configure one or more Committers corresponding to your needs or create a custom one. You can have a look at available Committers here: https://opensource.norconex.com/committers/ See CrawlerConfig.setCommitters(List).

HTTP Fetcher

To crawl and parse a document, it needs to be downloaded first. This is the role of one or more HTTP Fetchers. GenericHttpFetcher is the default implementation and can handle most web sites. There might be cases where a more specialized way of obtaining web resources is needed. For instance, JavaScript-generated web pages are often best handled by web browsers. In such case you can use the WebDriverHttpFetcher. You can also use setHttpFetchers(List) to supply own fetcher implementation.

HTTP Methods

A fetcher typically issues an HTTP GET request to obtain a document. There might be cases where you first want to issue a separate HEAD request. One example is to filter documents based on the HTTP HEAD response information, thus possibly saving downloading large files you don't want.

You can tell the crawler how it should handle HTTP GET and HEAD requests using using setFetchHttpGet(HttpMethodSupport) and setFetchHttpHead(HttpMethodSupport) respectively. For each, the options are:

  • DISABLED: No HTTP call willl be made using that method.
  • OPTIONAL: If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document can still be processed successfully by the other HTTP method. Only relevant when both HEAD and GET are enabled.
  • REQUIRED: If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document will be rejected and won't go any further, even if the other HTTP method was or could have been successful. Only relevant when both HEAD and GET are enabled.

If you enable only one HTTP method (default), then specifying OPTIONAL or REQUIRED for it have the same effect. At least one method needs to be enabled for an HTTP request to be attempted. By default HEAD requests are DISABLED and GET are REQUIRED. If you are unsure what settings to use, keep the defaults.

Filtering Unwanted Documents

Without filtering, you would typically crawl many documents you are not interested in. There are different types filtering offered to you, occurring at different type during a URL crawling process. The sooner in a URL processing life-cycle you filter out a document the more you can improve the crawler performance. It may be important for you to understand the differences:

  • Reference filters: The fastest way to exclude a document. The filtering rule applies on the URL, before any HTTP request is made for that URL. Rejected documents are not queued for processing. They are not be downloaded (thus no URLs are extracted). The specified "delay" between downloads is not applied (i.e. no delay for rejected documents).
  • Metadata filters: Applies filtering on a document metadata fields.

    If isFetchHttpHead() returns true, these filters will be invoked after the crawler performs a distinct HTTP HEAD request. It gives you the opportunity to filter documents based on the HTTP HEAD response to potentially save a more expensive HTTP GET request for download (but results in two HTTP requests for valid documents -- HEAD and GET). Filtering occurs before URLs are extracted.

    When isFetchHttpHead() is false, these filters will be invoked on the metadata of the HTTP response obtained from an HTTP GET request (as the document is downloaded). Filtering occurs after URLs are extracted.

  • Document filters: Use when having access to the document itself (and its content) is required to apply filtering. Always triggered after a document is downloaded and after URLs are extracted, but before it is imported (Importer module).
  • Importer filters: The Importer module also offers document filtering options. At that point a document is already downloaded and its links extracted. There are two types of filtering offered by the Importer: before and after document parsing. Use filters before parsing if you need to filter on raw content or want to prevent a more expensive parsing. Use filters after parsing when you need to read the content as plain text.

Robot Directives

By default, the crawler tries to respect instructions a web site as put in place for the benefit of crawlers. Here is a list of some of the popular ones that can be turned off or supports your own implementation.

Re-crawl Frequency

The crawler will crawl any given URL at most one time per crawling session. It is possible to skip documents that are not yet "ready" to be re-crawled to speed up each crawling sessions. Sitemap.xml directives to that effect are respected by default ("frequency" and "lastmod"). You can have your own conditions for re-crawl with setRecrawlableResolver(IRecrawlableResolver). This feature can be used for instance, to crawl a "news" section of your site more frequently than let's say, an "archive" section of your site.

Change Detection (Checksums)

To find out if a document has changed from one crawling session to another, the crawler creates and keeps a digital signature, or checksum of each crawled documents. Upon crawling the same URL again, a new checksum is created and compared against the previous one. Any difference indicates a modified document. There are two checksums at play, tested at different times. One obtained from a document metadata (default is LastModifiedMetadataChecksummer, and one from the document itself MD5DocumentChecksummer. You can provide your own implementation. See: CrawlerConfig.setMetadataChecksummer(IMetadataChecksummer) and CrawlerConfig.setDocumentChecksummer(IDocumentChecksummer).

Deduplication

EXPERIMENTAL: The crawler can attempt to detect and reject documents considered as duplicates within a crawler session. A document will be considered duplicate if there was already a document processed with the same metadata or document checksum. To enable this feature, set CrawlerConfig.setMetadataDeduplicate(boolean) and/or CrawlerConfig.setDocumentDeduplicate(boolean) to true. Setting those will have no effect if the corresponding checksummers are not set (null).

Deduplication can impact crawl performance. It is recommended you use it only if you can't distinguish duplicates via other means (URL normalizer, canonical URL support, etc.). Also, you should only enable this feature if you know your checksummer(s) will generate a checksum that is acceptably unique to you.

URL Extraction

To be able to crawl a web site, links need to be extracted from web pages. It is the job of a link extractor. It is possible to use multiple link extractor for different type of content. By default, the HtmlLinkExtractor is used, but you can add others or provide your own with setLinkExtractors(List).

There might be cases where you want a document to be parsed by the Importer and establish which links to process yourself during the importing phase (for more advanced use cases). In such cases, you can identify a document metadata field to use as a URL holding tanks after importing has occurred. URLs in that field will become eligible for crawling. See setPostImportLinks(TextMatcher).

XML configuration usage:


<crawler
    id="(crawler unique identifier)">
  <startURLs
      stayOnDomain="[false|true]"
      includeSubdomains="[false|true]"
      stayOnPort="[false|true]"
      stayOnProtocol="[false|true]"
      async="[false|true]">
    <!-- All the following tags are repeatable. -->
    <url>(a URL)</url>
    <urlsFile>(local path to a file containing URLs)</urlsFile>
    <sitemap>(URL to a sitemap XML)</sitemap>
    <provider
        class="(IStartURLsProvider implementation)"/>
  </startURLs>
  <urlNormalizers>
    <urlNormalizer
        class="(IURLNormalizer implementation)"/>
  </urlNormalizers>
  <delay
      class="(IDelayResolver implementation)"/>
  <maxDepth>(maximum crawl depth)</maxDepth>
  <keepDownloads>[false|true]</keepDownloads>
  <keepReferencedLinks>[INSCOPE|OUTSCOPE|MAXDEPTH]</keepReferencedLinks>
  <fetchHttpHead>[DISABLED|REQUIRED|OPTIONAL]</fetchHttpHead>
  <fetchHttpGet>[REQUIRED|DISABLED|OPTIONAL]</fetchHttpGet>
  <httpFetchers
      maxRetries="(number of times to retry a failed fetch attempt)"
      retryDelay="(how many milliseconds to wait between re-attempting)">
    <!-- Repeatable -->
    <fetcher
        class="(IHttpFetcher implementation)"/>
  </httpFetchers>
  <robotsTxt
      ignore="[false|true]"
      class="(IRobotsMetaProvider implementation)"/>
  <sitemapResolver
      ignore="[false|true]"
      class="(ISitemapResolver implementation)"/>
  <recrawlableResolver
      class="(IRecrawlableResolver implementation)"/>
  <canonicalLinkDetector
      ignore="[false|true]"
      class="(ICanonicalLinkDetector implementation)"/>
  <robotsMeta
      ignore="[false|true]"
      class="(IRobotsMetaProvider implementation)"/>
  <linkExtractors>
    <!-- Repeatable -->
    <extractor
        class="(ILinkExtractor implementation)"/>
  </linkExtractors>
  <preImportProcessors>
    <!-- Repeatable -->
    <processor
        class="(IHttpDocumentProcessor implementation)"/>
  </preImportProcessors>
  <postImportProcessors>
    <!-- Repeatable -->
    <processor
        class="(IHttpDocumentProcessor implementation)"/>
  </postImportProcessors>
  <postImportLinks
      keep="[false|true]">
    <fieldMatcher/>
  </postImportLinks>
</crawler>
Author:
Pascal Essiembre
  • Constructor Details

    • HttpCrawlerConfig

      public HttpCrawlerConfig()
  • Method Details

    • isFetchHttpHead

      @Deprecated public boolean isFetchHttpHead()
      Deprecated.
      Deprecated.
      Returns:
      true if fetching HTTP response headers separately
      Since:
      3.0.0-M1
    • setFetchHttpHead

      @Deprecated public void setFetchHttpHead(boolean fetchHttpHead)
      Deprecated.
      Parameters:
      fetchHttpHead - true if fetching HTTP response headers separately
      Since:
      3.0.0-M1
    • getFetchHttpHead

      public HttpCrawlerConfig.HttpMethodSupport getFetchHttpHead()

      Gets whether to fetch HTTP response headers using an HTTP HEAD request. That HTTP request is performed separately from a document download request (HTTP "GET"). Useful when you need to filter documents based on HTTP header values, without downloading them first (e.g., to save bandwidth). When dealing with small documents on average, it may be best to avoid issuing two requests when a single one could do it.

      HttpCrawlerConfig.HttpMethodSupport.DISABLED by default. See class documentation for more details.

      Returns:
      HTTP HEAD method support
      Since:
      3.0.0
    • setFetchHttpHead

      public void setFetchHttpHead(HttpCrawlerConfig.HttpMethodSupport fetchHttpHead)

      Sets whether to fetch HTTP response headers using an HTTP HEAD request.

      See class documentation for more details.

      Parameters:
      fetchHttpHead - HTTP HEAD method support
      Since:
      3.0.0
    • getFetchHttpGet

      public HttpCrawlerConfig.HttpMethodSupport getFetchHttpGet()

      Gets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.

      HttpCrawlerConfig.HttpMethodSupport.REQUIRED by default. See class documentation for more details.

      Returns:
      true if fetching HTTP response headers separately
      Since:
      3.0.0
    • setFetchHttpGet

      public void setFetchHttpGet(HttpCrawlerConfig.HttpMethodSupport fetchHttpGet)

      Sets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.

      See class documentation for more details.

      Parameters:
      fetchHttpGet - true if fetching HTTP response headers separately
      Since:
      3.0.0
    • getStartURLs

      public List<String> getStartURLs()
      Gets URLs to initiate crawling from.
      Returns:
      start URLs (never null)
    • setStartURLs

      public void setStartURLs(String... startURLs)
      Sets URLs to initiate crawling from.
      Parameters:
      startURLs - start URLs
    • setStartURLs

      public void setStartURLs(List<String> startURLs)
      Sets URLs to initiate crawling from.
      Parameters:
      startURLs - start URLs
      Since:
      3.0.0
    • getStartURLsFiles

      public List<Path> getStartURLsFiles()
      Gets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.
      Returns:
      file paths of seed files containing URLs (never null)
      Since:
      2.3.0
    • setStartURLsFiles

      public void setStartURLsFiles(Path... startURLsFiles)
      Sets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.
      Parameters:
      startURLsFiles - file paths of seed files containing URLs
      Since:
      2.3.0
    • setStartURLsFiles

      public void setStartURLsFiles(List<Path> startURLsFiles)
      Sets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.
      Parameters:
      startURLsFiles - file paths of seed files containing URLs
      Since:
      3.0.0
    • getStartSitemapURLs

      public List<String> getStartSitemapURLs()
      Gets sitemap URLs to be used as starting points for crawling.
      Returns:
      sitemap URLs (never null)
      Since:
      2.3.0
    • setStartSitemapURLs

      public void setStartSitemapURLs(String... startSitemapURLs)
      Sets the sitemap URLs used as starting points for crawling.
      Parameters:
      startSitemapURLs - sitemap URLs
      Since:
      2.3.0
    • setStartSitemapURLs

      public void setStartSitemapURLs(List<String> startSitemapURLs)
      Sets the sitemap URLs used as starting points for crawling.
      Parameters:
      startSitemapURLs - sitemap URLs
      Since:
      3.0.0
    • getStartURLsProviders

      public List<IStartURLsProvider> getStartURLsProviders()
      Gets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.
      Returns:
      start URL providers (never null)
      Since:
      2.7.0
    • setStartURLsProviders

      public void setStartURLsProviders(IStartURLsProvider... startURLsProviders)
      Sets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.
      Parameters:
      startURLsProviders - start URL provider
      Since:
      2.7.0
    • setStartURLsProviders

      public void setStartURLsProviders(List<IStartURLsProvider> startURLsProviders)
      Sets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.
      Parameters:
      startURLsProviders - start URL provider
      Since:
      3.0.0
    • isStartURLsAsync

      public boolean isStartURLsAsync()
      Gets whether the start URLs should be loaded asynchronously. When true, the crawler will start processing URLs in the queue even if start URLs are still being loaded. While this may speed up crawling, it may have an unexpected effect on accuracy of HttpDocMetadata.DEPTH. Use of this option is only recommended when start URLs takes a significant time to load (e.g., large sitemaps).
      Returns:
      true if async.
      Since:
      3.0.0
    • setStartURLsAsync

      public void setStartURLsAsync(boolean asyncStartURLs)
      Sets whether the start URLs should be loaded asynchronously. When true, the crawler will start processing URLs in the queue even if start URLs are still being loaded. While this may speed up crawling, it may have an unexpected effect on accuracy of HttpDocMetadata.DEPTH. Use of this option is only recommended when start URLs takes a significant time to load (e.g., large sitemaps).
      Parameters:
      asyncStartURLs - true if async.
      Since:
      3.0.0
    • setMaxDepth

      public void setMaxDepth(int depth)
    • getMaxDepth

      public int getMaxDepth()
    • getHttpFetchers

      public List<IHttpFetcher> getHttpFetchers()
      Gets HTTP fetchers.
      Returns:
      start URLs (never null)
      Since:
      3.0.0
    • setHttpFetchers

      public void setHttpFetchers(IHttpFetcher... httpFetchers)
      Sets HTTP fetchers.
      Parameters:
      httpFetchers - list of HTTP fetchers
      Since:
      3.0.0
    • setHttpFetchers

      public void setHttpFetchers(List<IHttpFetcher> httpFetchers)
      Sets HTTP fetchers.
      Parameters:
      httpFetchers - list of HTTP fetchers
      Since:
      3.0.0
    • getHttpFetchersMaxRetries

      public int getHttpFetchersMaxRetries()
      Gets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures. Default is zero (won't retry).
      Returns:
      number of times
      Since:
      3.0.0
    • setHttpFetchersMaxRetries

      public void setHttpFetchersMaxRetries(int httpFetchersMaxRetries)
      Sets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures.
      Parameters:
      httpFetchersMaxRetries - maximum number of retries
      Since:
      3.0.0
    • getHttpFetchersRetryDelay

      public long getHttpFetchersRetryDelay()
      Gets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds). Default is zero (no delay).
      Returns:
      retry delay
      Since:
      3.0.0
    • setHttpFetchersRetryDelay

      public void setHttpFetchersRetryDelay(long httpFetchersRetryDelay)
      Sets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds).
      Parameters:
      httpFetchersRetryDelay - retry delay
      Since:
      3.0.0
    • getCanonicalLinkDetector

      public ICanonicalLinkDetector getCanonicalLinkDetector()
      Gets the canonical link detector.
      Returns:
      the canonical link detector, or null if none are defined.
      Since:
      2.2.0
    • setCanonicalLinkDetector

      public void setCanonicalLinkDetector(ICanonicalLinkDetector canonicalLinkDetector)
      Sets the canonical link detector. To disable canonical link detection, either pass a null argument, or invoke setIgnoreCanonicalLinks(boolean) with a true value.
      Parameters:
      canonicalLinkDetector - the canonical link detector
      Since:
      2.2.0
    • getLinkExtractors

      public List<ILinkExtractor> getLinkExtractors()
      Gets link extractors.
      Returns:
      link extractors
    • setLinkExtractors

      public void setLinkExtractors(ILinkExtractor... linkExtractors)
      Sets link extractors.
      Parameters:
      linkExtractors - link extractors
    • setLinkExtractors

      public void setLinkExtractors(List<ILinkExtractor> linkExtractors)
      Sets link extractors.
      Parameters:
      linkExtractors - link extractors
      Since:
      3.0.0
    • getRobotsTxtProvider

      public IRobotsTxtProvider getRobotsTxtProvider()
    • setRobotsTxtProvider

      public void setRobotsTxtProvider(IRobotsTxtProvider robotsTxtProvider)
    • getUrlNormalizer

      @Deprecated(forRemoval=true, since="3.1.0") public IURLNormalizer getUrlNormalizer()
      Deprecated, for removal: This API element is subject to removal in a future version.
      Since 3.1.0, use getUrlNormalizers() instead.
      Returns:
      URL normalizer
    • setUrlNormalizer

      @Deprecated(forRemoval=true, since="3.1.0") public void setUrlNormalizer(IURLNormalizer urlNormalizer)
      Deprecated, for removal: This API element is subject to removal in a future version.
      Since 3.1.0, use setUrlNormalizers(List) instead.
      Parameters:
      urlNormalizer - URL normalizer
    • getUrlNormalizers

      public List<IURLNormalizer> getUrlNormalizers()
      Gets URL normalizers. Defaults to a single GenericURLNormalizer instance (with its default configuration).
      Returns:
      URL normalizers or an empty list (never null)
      Since:
      3.1.0
    • setUrlNormalizers

      public void setUrlNormalizers(List<IURLNormalizer> urlNormalizers)
      Sets URL normalizers.
      Parameters:
      urlNormalizers - URL normalizers
      Since:
      3.1.0
    • getDelayResolver

      public IDelayResolver getDelayResolver()
    • setDelayResolver

      public void setDelayResolver(IDelayResolver delayResolver)
    • getPreImportProcessors

      public List<IHttpDocumentProcessor> getPreImportProcessors()
      Gets pre-import processors.
      Returns:
      pre-import processors
    • setPreImportProcessors

      public void setPreImportProcessors(IHttpDocumentProcessor... preImportProcessors)
      Sets pre-import processors.
      Parameters:
      preImportProcessors - pre-import processors
    • setPreImportProcessors

      public void setPreImportProcessors(List<IHttpDocumentProcessor> preImportProcessors)
      Sets pre-import processors.
      Parameters:
      preImportProcessors - pre-import processors
      Since:
      3.0.0
    • getPostImportProcessors

      public List<IHttpDocumentProcessor> getPostImportProcessors()
      Gets post-import processors.
      Returns:
      post-import processors
    • setPostImportProcessors

      public void setPostImportProcessors(IHttpDocumentProcessor... postImportProcessors)
      Sets post-import processors.
      Parameters:
      postImportProcessors - post-import processors
    • setPostImportProcessors

      public void setPostImportProcessors(List<IHttpDocumentProcessor> postImportProcessors)
      Sets post-import processors.
      Parameters:
      postImportProcessors - post-import processors
      Since:
      3.0.0
    • isIgnoreRobotsTxt

      public boolean isIgnoreRobotsTxt()
    • setIgnoreRobotsTxt

      public void setIgnoreRobotsTxt(boolean ignoreRobotsTxt)
    • isKeepDownloads

      public boolean isKeepDownloads()
    • setKeepDownloads

      public void setKeepDownloads(boolean keepDownloads)
    • isKeepOutOfScopeLinks

      @Deprecated public boolean isKeepOutOfScopeLinks()
      Deprecated.
      Since 3.0.0, use getKeepReferencedLinks().
      Whether links not in scope should be stored as metadata under HttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE
      Returns:
      true if keeping URLs not in scope.
      Since:
      2.8.0
    • setKeepOutOfScopeLinks

      @Deprecated public void setKeepOutOfScopeLinks(boolean keepOutOfScopeLinks)
      Deprecated.
      Since 3.0.0, use setKeepReferencedLinks(Set).
      Sets whether links not in scope should be stored as metadata under HttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE
      Parameters:
      keepOutOfScopeLinks - true if keeping URLs not in scope
      Since:
      2.8.0
    • getKeepReferencedLinks

      public Set<HttpCrawlerConfig.ReferencedLinkType> getKeepReferencedLinks()
      Gets what type of referenced links to keep, if any. Those links are URLs extracted by link extractors. See class documentation for more details.
      Returns:
      preferences for keeping links
      Since:
      3.0.0
    • setKeepReferencedLinks

      public void setKeepReferencedLinks(Set<HttpCrawlerConfig.ReferencedLinkType> keepReferencedLinks)
      Sets whether to keep referenced links and what to keep. Those links are URLs extracted by link extractors. See class documentation for more details.
      Parameters:
      keepReferencedLinks - option for keeping links
      Since:
      3.0.0
    • setKeepReferencedLinks

      public void setKeepReferencedLinks(HttpCrawlerConfig.ReferencedLinkType... keepReferencedLinks)
      Sets whether to keep referenced links and what to keep. Those links are URLs extracted by link extractors. See class documentation for more details.
      Parameters:
      keepReferencedLinks - option for keeping links
      Since:
      3.0.0
    • isIgnoreRobotsMeta

      public boolean isIgnoreRobotsMeta()
    • setIgnoreRobotsMeta

      public void setIgnoreRobotsMeta(boolean ignoreRobotsMeta)
    • getRobotsMetaProvider

      public IRobotsMetaProvider getRobotsMetaProvider()
    • setRobotsMetaProvider

      public void setRobotsMetaProvider(IRobotsMetaProvider robotsMetaProvider)
    • isIgnoreSitemap

      public boolean isIgnoreSitemap()
      Whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()) are never ignored.
      Returns:
      true to ignore sitemaps
    • setIgnoreSitemap

      public void setIgnoreSitemap(boolean ignoreSitemap)
      Sets whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()) are never ignored.
      Parameters:
      ignoreSitemap - true to ignore sitemaps
    • getSitemapResolver

      public ISitemapResolver getSitemapResolver()
    • setSitemapResolver

      public void setSitemapResolver(ISitemapResolver sitemapResolver)
    • isIgnoreCanonicalLinks

      public boolean isIgnoreCanonicalLinks()
      Whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. When processed (default), URL pages with a canonical URL pointer in them are not processed.
      Returns:
      true if ignoring canonical links processed.
      Since:
      2.2.0
    • setIgnoreCanonicalLinks

      public void setIgnoreCanonicalLinks(boolean ignoreCanonicalLinks)
      Sets whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. If true URL pages with a canonical URL pointer in them are not
      Parameters:
      ignoreCanonicalLinks - true if ignoring canonical links
      Since:
      2.2.0
    • getURLCrawlScopeStrategy

      public URLCrawlScopeStrategy getURLCrawlScopeStrategy()
      Gets the strategy to use to determine if a URL is in scope.
      Returns:
      the strategy
    • setUrlCrawlScopeStrategy

      public void setUrlCrawlScopeStrategy(URLCrawlScopeStrategy urlCrawlScopeStrategy)
      Sets the strategy to use to determine if a URL is in scope.
      Parameters:
      urlCrawlScopeStrategy - strategy to use
      Since:
      2.8.1
    • getRecrawlableResolver

      public IRecrawlableResolver getRecrawlableResolver()
      Gets the recrawlable resolver.
      Returns:
      recrawlable resolver
      Since:
      2.5.0
    • setRecrawlableResolver

      public void setRecrawlableResolver(IRecrawlableResolver recrawlableResolver)
      Sets the recrawlable resolver.
      Parameters:
      recrawlableResolver - the recrawlable resolver
      Since:
      2.5.0
    • getPostImportLinks

      public TextMatcher getPostImportLinks()
      Gets a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.
      Returns:
      field matcher
      Since:
      3.0.0
    • setPostImportLinks

      public void setPostImportLinks(TextMatcher fieldMatcher)
      Set a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.
      Parameters:
      fieldMatcher - field matcher
      Since:
      3.0.0
    • isPostImportLinksKeep

      public boolean isPostImportLinksKeep()
      Gets whether to keep the importer-generated field holding URLs to consider for crawling.
      Returns:
      true if keeping
      Since:
      3.0.0
    • setPostImportLinksKeep

      public void setPostImportLinksKeep(boolean postImportLinksKeep)
      Sets whether to keep the importer-generated field holding URLs to consider for crawling.
      Parameters:
      postImportLinksKeep - true if keeping
      Since:
      3.0.0
    • saveCrawlerConfigToXML

      protected void saveCrawlerConfigToXML(XML xml)
      Specified by:
      saveCrawlerConfigToXML in class CrawlerConfig
    • loadCrawlerConfigFromXML

      protected void loadCrawlerConfigFromXML(XML xml)
      Specified by:
      loadCrawlerConfigFromXML in class CrawlerConfig
    • equals

      public boolean equals(Object other)
      Overrides:
      equals in class CrawlerConfig
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class CrawlerConfig
    • toString

      public String toString()
      Overrides:
      toString in class CrawlerConfig