Class HttpCrawlerConfig

  • All Implemented Interfaces:
    IXMLConfigurable

    public class HttpCrawlerConfig
    extends CrawlerConfig

    HTTP Crawler configuration.

    Start URLs

    Crawling begins with one or more "start" URLs. Multiple start URLs can be defined, in a combination of ways:

    Scope: To limit crawling to specific web domains, and avoid creating many filters to that effect, you can tell the crawler to "stay" within the web site "scope" with setUrlCrawlScopeStrategy(URLCrawlScopeStrategy).

    URL Normalization

    Pages on web sites are often referenced using different URL patterns. Such URL variations can fool the crawler into downloading the same document multiple times. To avoid this, URLs are "normalized". That is, they are converted so they are always formulated the same way. By default, the crawler only applies normalization in ways that are semantically equivalent (see GenericURLNormalizer).

    Crawl Speed

    Be kind to web sites you crawl. Being too aggressive can be perceived as a cyber-attack by the targeted web site (e.g., DoS attack). This can lead to your crawler being blocked.

    For this reason, the crawler plays nice by default. It will wait a few seconds between each page download, regardless of the maximum number of threads specified or whether pages crawled are on different web sites. This can of course be changed to be as fast as you want. See GenericDelayResolver) for changing default options. You can also provide your own "delay resolver" by supplying a class implementing IDelayResolver.

    Crawl Depth

    The crawl depth represents how many level from the start URL the crawler goes. From a browser user perspective, it can be seen as the number of link "clicks" required from a start URL in order to get to a specific page. The crawler will crawl as deep for as long as it discovers new URLs not getting rejected by your configuration. This is not always desirable. For instance, a web site could have dynamically generated URLs with infinite possibilities (e.g., dynamically generated web calendars). To avoid infinite crawls, it is recommended to limit the maximum depth to something reasonable for your site with setMaxDepth(int).

    Keeping downloaded files

    Downloaded files are deleted after being processed. Set setKeepDownloads(boolean) to true in order to preserve them. Files will be kept under a new "downloads" folder found under your working directory. Keep in mind this is not a method for cloning a site. Use with caution on large sites as it can quickly fill up the local disk space.

    Keeping Referenced Links

    By default the crawler stores, as metadata, URLs extracted from documents that are in scope. Exceptions are pages discovered at the configured maximum depth (setMaxDepth(int)). This can be changed using the setKeepReferencedLinks(Set) method. Changing this setting has no incidence on what page gets crawled. Possible options are:

    Orphan documents

    Orphans are valid documents, which on subsequent crawls can no longer be reached (e.g. there are no longer referenced). This is regardless whether the file has been deleted or not at the source. You can tell the crawler how to handle those with CrawlerConfig.setOrphansStrategy(OrphansStrategy). Possible options are:

    • PROCESS: Default. Tries to crawl orphans normally as if they were still reachable by the crawler.
    • IGNORE: Does nothing with orphans (not deleted, not processed)..
    • DELETE: Orphans are sent to your Committer for deletion.

    Error Handling

    By default the crawler logs exceptions while trying to prevent them from terminating a crawling session. There might be cases where you want the crawler to halt upon encountering some types of exceptions. You can do so with CrawlerConfig.setStopOnExceptions(List).

    Crawler Events

    The crawler fires all kind of events to notify interested parties of such things as when a document is rejected, imported, committed, etc.). You can listen to crawler events using CrawlerConfig.setEventListeners(List).

    Data Store (Cache)

    During and between crawl sessions, the crawler needs to preserve specific information in order to keep track of things such as a queue of document reference to process, those already processed, whether a document has been modified since last crawled, caching of document checksums, etc. For this, the crawler uses a database we call a crawl data store engine. The default implementation uses the local file system to store these (see MVStoreDataStoreEngine). While very capable and suitable for most sites, if you need a larger storage system, you can provide your own implementation with CrawlerConfig.setDataStoreEngine(IDataStoreEngine).

    Document Importing

    The process of transforming, enhancing, parsing to extracting plain text and many other document-specific processing activities are handled by the Norconex Importer module. See ImporterConfig for many additional configuration options.

    Bad Documents

    On a fresh crawl, documents that are unreachable or not obtained successfully for some reason are simply logged and ignored. On the other hand, documents that were successfully crawled once and are suddenly failing on a subsequent crawl are considered "spoiled". You can decide whether to grace (retry next time), delete, or ignore those spoiled documents with CrawlerConfig.setSpoiledReferenceStrategizer(ISpoiledReferenceStrategizer).

    Committing Documents

    The last step of a successful processing of a document is to store it in your preferred target repository (or repositories). For this to happen, you have to configure one or more Committers corresponding to your needs or create a custom one. You can have a look at available Committers here: https://opensource.norconex.com/committers/ See CrawlerConfig.setCommitters(List).

    HTTP Fetcher

    To crawl and parse a document, it needs to be downloaded first. This is the role of one or more HTTP Fetchers. GenericHttpFetcher is the default implementation and can handle most web sites. There might be cases where a more specialized way of obtaining web resources is needed. For instance, JavaScript-generated web pages are often best handled by web browsers. In such case you can use the WebDriverHttpFetcher. You can also use setHttpFetchers(List) to supply own fetcher implementation.

    HTTP Methods

    A fetcher typically issues an HTTP GET request to obtain a document. There might be cases where you first want to issue a separate HEAD request. One example is to filter documents based on the HTTP HEAD response information, thus possibly saving downloading large files you don't want.

    You can tell the crawler how it should handle HTTP GET and HEAD requests using using setFetchHttpGet(HttpMethodSupport) and setFetchHttpHead(HttpMethodSupport) respectively. For each, the options are:

    • DISABLED: No HTTP call willl be made using that method.
    • OPTIONAL: If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document can still be processed successfully by the other HTTP method. Only relevant when both HEAD and GET are enabled.
    • REQUIRED: If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document will be rejected and won't go any further, even if the other HTTP method was or could have been successful. Only relevant when both HEAD and GET are enabled.

    If you enable only one HTTP method (default), then specifying OPTIONAL or REQUIRED for it have the same effect. At least one method needs to be enabled for an HTTP request to be attempted. By default HEAD requests are DISABLED and GET are REQUIRED. If you are unsure what settings to use, keep the defaults.

    Filtering Unwanted Documents

    Without filtering, you would typically crawl many documents you are not interested in. There are different types filtering offered to you, occurring at different type during a URL crawling process. The sooner in a URL processing life-cycle you filter out a document the more you can improve the crawler performance. It may be important for you to understand the differences:

    • Reference filters: The fastest way to exclude a document. The filtering rule applies on the URL, before any HTTP request is made for that URL. Rejected documents are not queued for processing. They are not be downloaded (thus no URLs are extracted). The specified "delay" between downloads is not applied (i.e. no delay for rejected documents).
    • Metadata filters: Applies filtering on a document metadata fields.

      If isFetchHttpHead() returns true, these filters will be invoked after the crawler performs a distinct HTTP HEAD request. It gives you the opportunity to filter documents based on the HTTP HEAD response to potentially save a more expensive HTTP GET request for download (but results in two HTTP requests for valid documents -- HEAD and GET). Filtering occurs before URLs are extracted.

      When isFetchHttpHead() is false, these filters will be invoked on the metadata of the HTTP response obtained from an HTTP GET request (as the document is downloaded). Filtering occurs after URLs are extracted.

    • Document filters: Use when having access to the document itself (and its content) is required to apply filtering. Always triggered after a document is downloaded and after URLs are extracted, but before it is imported (Importer module).
    • Importer filters: The Importer module also offers document filtering options. At that point a document is already downloaded and its links extracted. There are two types of filtering offered by the Importer: before and after document parsing. Use filters before parsing if you need to filter on raw content or want to prevent a more expensive parsing. Use filters after parsing when you need to read the content as plain text.

    Robot Directives

    By default, the crawler tries to respect instructions a web site as put in place for the benefit of crawlers. Here is a list of some of the popular ones that can be turned off or supports your own implementation.

    Re-crawl Frequency

    The crawler will crawl any given URL at most one time per crawling session. It is possible to skip documents that are not yet "ready" to be re-crawled to speed up each crawling sessions. Sitemap.xml directives to that effect are respected by default ("frequency" and "lastmod"). You can have your own conditions for re-crawl with setRecrawlableResolver(IRecrawlableResolver). This feature can be used for instance, to crawl a "news" section of your site more frequently than let's say, an "archive" section of your site.

    Change Detection (Checksums)

    To find out if a document has changed from one crawling session to another, the crawler creates and keeps a digital signature, or checksum of each crawled documents. Upon crawling the same URL again, a new checksum is created and compared against the previous one. Any difference indicates a modified document. There are two checksums at play, tested at different times. One obtained from a document metadata (default is LastModifiedMetadataChecksummer, and one from the document itself MD5DocumentChecksummer. You can provide your own implementation. See: CrawlerConfig.setMetadataChecksummer(IMetadataChecksummer) and CrawlerConfig.setDocumentChecksummer(IDocumentChecksummer).

    Deduplication

    EXPERIMENTAL: The crawler can attempt to detect and reject documents considered as duplicates within a crawler session. A document will be considered duplicate if there was already a document processed with the same metadata or document checksum. To enable this feature, set CrawlerConfig.setMetadataDeduplicate(boolean) and/or CrawlerConfig.setDocumentDeduplicate(boolean) to true. Setting those will have no effect if the corresponding checksummers are not set (null).

    Deduplication can impact crawl performance. It is recommended you use it only if you can't distinguish duplicates via other means (URL normalizer, canonical URL support, etc.). Also, you should only enable this feature if you know your checksummer(s) will generate a checksum that is acceptably unique to you.

    URL Extraction

    To be able to crawl a web site, links need to be extracted from web pages. It is the job of a link extractor. It is possible to use multiple link extractor for different type of content. By default, the HtmlLinkExtractor is used, but you can add others or provide your own with setLinkExtractors(List).

    There might be cases where you want a document to be parsed by the Importer and establish which links to process yourself during the importing phase (for more advanced use cases). In such cases, you can identify a document metadata field to use as a URL holding tanks after importing has occurred. URLs in that field will become eligible for crawling. See setPostImportLinks(TextMatcher).

    XML configuration usage:

    
    <crawler
        id="(crawler unique identifier)">
      <startURLs
          stayOnDomain="[false|true]"
          includeSubdomains="[false|true]"
          stayOnPort="[false|true]"
          stayOnProtocol="[false|true]"
          async="[false|true]">
        <!-- All the following tags are repeatable. -->
        <url>(a URL)</url>
        <urlsFile>(local path to a file containing URLs)</urlsFile>
        <sitemap>(URL to a sitemap XML)</sitemap>
        <provider
            class="(IStartURLsProvider implementation)"/>
      </startURLs>
      <urlNormalizers>
        <urlNormalizer
            class="(IURLNormalizer implementation)"/>
      </urlNormalizers>
      <delay
          class="(IDelayResolver implementation)"/>
      <maxDepth>(maximum crawl depth)</maxDepth>
      <keepDownloads>[false|true]</keepDownloads>
      <keepReferencedLinks>[INSCOPE|OUTSCOPE|MAXDEPTH]</keepReferencedLinks>
      <fetchHttpHead>[DISABLED|REQUIRED|OPTIONAL]</fetchHttpHead>
      <fetchHttpGet>[REQUIRED|DISABLED|OPTIONAL]</fetchHttpGet>
      <httpFetchers
          maxRetries="(number of times to retry a failed fetch attempt)"
          retryDelay="(how many milliseconds to wait between re-attempting)">
        <!-- Repeatable -->
        <fetcher
            class="(IHttpFetcher implementation)"/>
      </httpFetchers>
      <robotsTxt
          ignore="[false|true]"
          class="(IRobotsMetaProvider implementation)"/>
      <sitemapResolver
          ignore="[false|true]"
          class="(ISitemapResolver implementation)"/>
      <recrawlableResolver
          class="(IRecrawlableResolver implementation)"/>
      <canonicalLinkDetector
          ignore="[false|true]"
          class="(ICanonicalLinkDetector implementation)"/>
      <robotsMeta
          ignore="[false|true]"
          class="(IRobotsMetaProvider implementation)"/>
      <linkExtractors>
        <!-- Repeatable -->
        <extractor
            class="(ILinkExtractor implementation)"/>
      </linkExtractors>
      <preImportProcessors>
        <!-- Repeatable -->
        <processor
            class="(IHttpDocumentProcessor implementation)"/>
      </preImportProcessors>
      <postImportProcessors>
        <!-- Repeatable -->
        <processor
            class="(IHttpDocumentProcessor implementation)"/>
      </postImportProcessors>
      <postImportLinks
          keep="[false|true]">
        <fieldMatcher/>
      </postImportLinks>
    </crawler>
    Author:
    Pascal Essiembre
    • Constructor Detail

      • HttpCrawlerConfig

        public HttpCrawlerConfig()
    • Method Detail

      • isFetchHttpHead

        @Deprecated
        public boolean isFetchHttpHead()
        Deprecated.
        Deprecated.
        Returns:
        true if fetching HTTP response headers separately
        Since:
        3.0.0-M1
      • setFetchHttpHead

        @Deprecated
        public void setFetchHttpHead​(boolean fetchHttpHead)
        Deprecated.
        Parameters:
        fetchHttpHead - true if fetching HTTP response headers separately
        Since:
        3.0.0-M1
      • getFetchHttpHead

        public HttpCrawlerConfig.HttpMethodSupport getFetchHttpHead()

        Gets whether to fetch HTTP response headers using an HTTP HEAD request. That HTTP request is performed separately from a document download request (HTTP "GET"). Useful when you need to filter documents based on HTTP header values, without downloading them first (e.g., to save bandwidth). When dealing with small documents on average, it may be best to avoid issuing two requests when a single one could do it.

        HttpCrawlerConfig.HttpMethodSupport.DISABLED by default. See class documentation for more details.

        Returns:
        HTTP HEAD method support
        Since:
        3.0.0
      • setFetchHttpHead

        public void setFetchHttpHead​(HttpCrawlerConfig.HttpMethodSupport fetchHttpHead)

        Sets whether to fetch HTTP response headers using an HTTP HEAD request.

        See class documentation for more details.

        Parameters:
        fetchHttpHead - HTTP HEAD method support
        Since:
        3.0.0
      • getFetchHttpGet

        public HttpCrawlerConfig.HttpMethodSupport getFetchHttpGet()

        Gets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.

        HttpCrawlerConfig.HttpMethodSupport.REQUIRED by default. See class documentation for more details.

        Returns:
        true if fetching HTTP response headers separately
        Since:
        3.0.0
      • setFetchHttpGet

        public void setFetchHttpGet​(HttpCrawlerConfig.HttpMethodSupport fetchHttpGet)

        Sets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.

        See class documentation for more details.

        Parameters:
        fetchHttpGet - true if fetching HTTP response headers separately
        Since:
        3.0.0
      • getStartURLs

        public List<String> getStartURLs()
        Gets URLs to initiate crawling from.
        Returns:
        start URLs (never null)
      • setStartURLs

        public void setStartURLs​(String... startURLs)
        Sets URLs to initiate crawling from.
        Parameters:
        startURLs - start URLs
      • setStartURLs

        public void setStartURLs​(List<String> startURLs)
        Sets URLs to initiate crawling from.
        Parameters:
        startURLs - start URLs
        Since:
        3.0.0
      • getStartURLsFiles

        public List<Path> getStartURLsFiles()
        Gets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.
        Returns:
        file paths of seed files containing URLs (never null)
        Since:
        2.3.0
      • setStartURLsFiles

        public void setStartURLsFiles​(Path... startURLsFiles)
        Sets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.
        Parameters:
        startURLsFiles - file paths of seed files containing URLs
        Since:
        2.3.0
      • setStartURLsFiles

        public void setStartURLsFiles​(List<Path> startURLsFiles)
        Sets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.
        Parameters:
        startURLsFiles - file paths of seed files containing URLs
        Since:
        3.0.0
      • getStartSitemapURLs

        public List<String> getStartSitemapURLs()
        Gets sitemap URLs to be used as starting points for crawling.
        Returns:
        sitemap URLs (never null)
        Since:
        2.3.0
      • setStartSitemapURLs

        public void setStartSitemapURLs​(String... startSitemapURLs)
        Sets the sitemap URLs used as starting points for crawling.
        Parameters:
        startSitemapURLs - sitemap URLs
        Since:
        2.3.0
      • setStartSitemapURLs

        public void setStartSitemapURLs​(List<String> startSitemapURLs)
        Sets the sitemap URLs used as starting points for crawling.
        Parameters:
        startSitemapURLs - sitemap URLs
        Since:
        3.0.0
      • getStartURLsProviders

        public List<IStartURLsProvider> getStartURLsProviders()
        Gets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.
        Returns:
        start URL providers (never null)
        Since:
        2.7.0
      • setStartURLsProviders

        public void setStartURLsProviders​(IStartURLsProvider... startURLsProviders)
        Sets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.
        Parameters:
        startURLsProviders - start URL provider
        Since:
        2.7.0
      • setStartURLsProviders

        public void setStartURLsProviders​(List<IStartURLsProvider> startURLsProviders)
        Sets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.
        Parameters:
        startURLsProviders - start URL provider
        Since:
        3.0.0
      • isStartURLsAsync

        public boolean isStartURLsAsync()
        Gets whether the start URLs should be loaded asynchronously. When true, the crawler will start processing URLs in the queue even if start URLs are still being loaded. While this may speed up crawling, it may have an unexpected effect on accuracy of HttpDocMetadata.DEPTH. Use of this option is only recommended when start URLs takes a significant time to load (e.g., large sitemaps).
        Returns:
        true if async.
        Since:
        3.0.0
      • setStartURLsAsync

        public void setStartURLsAsync​(boolean asyncStartURLs)
        Sets whether the start URLs should be loaded asynchronously. When true, the crawler will start processing URLs in the queue even if start URLs are still being loaded. While this may speed up crawling, it may have an unexpected effect on accuracy of HttpDocMetadata.DEPTH. Use of this option is only recommended when start URLs takes a significant time to load (e.g., large sitemaps).
        Parameters:
        asyncStartURLs - true if async.
        Since:
        3.0.0
      • setMaxDepth

        public void setMaxDepth​(int depth)
      • getMaxDepth

        public int getMaxDepth()
      • getHttpFetchers

        public List<IHttpFetcher> getHttpFetchers()
        Gets HTTP fetchers.
        Returns:
        start URLs (never null)
        Since:
        3.0.0
      • setHttpFetchers

        public void setHttpFetchers​(IHttpFetcher... httpFetchers)
        Sets HTTP fetchers.
        Parameters:
        httpFetchers - list of HTTP fetchers
        Since:
        3.0.0
      • setHttpFetchers

        public void setHttpFetchers​(List<IHttpFetcher> httpFetchers)
        Sets HTTP fetchers.
        Parameters:
        httpFetchers - list of HTTP fetchers
        Since:
        3.0.0
      • getHttpFetchersMaxRetries

        public int getHttpFetchersMaxRetries()
        Gets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures. Default is zero (won't retry).
        Returns:
        number of times
        Since:
        3.0.0
      • setHttpFetchersMaxRetries

        public void setHttpFetchersMaxRetries​(int httpFetchersMaxRetries)
        Sets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures.
        Parameters:
        httpFetchersMaxRetries - maximum number of retries
        Since:
        3.0.0
      • getHttpFetchersRetryDelay

        public long getHttpFetchersRetryDelay()
        Gets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds). Default is zero (no delay).
        Returns:
        retry delay
        Since:
        3.0.0
      • setHttpFetchersRetryDelay

        public void setHttpFetchersRetryDelay​(long httpFetchersRetryDelay)
        Sets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds).
        Parameters:
        httpFetchersRetryDelay - retry delay
        Since:
        3.0.0
      • getCanonicalLinkDetector

        public ICanonicalLinkDetector getCanonicalLinkDetector()
        Gets the canonical link detector.
        Returns:
        the canonical link detector, or null if none are defined.
        Since:
        2.2.0
      • setCanonicalLinkDetector

        public void setCanonicalLinkDetector​(ICanonicalLinkDetector canonicalLinkDetector)
        Sets the canonical link detector. To disable canonical link detection, either pass a null argument, or invoke setIgnoreCanonicalLinks(boolean) with a true value.
        Parameters:
        canonicalLinkDetector - the canonical link detector
        Since:
        2.2.0
      • getLinkExtractors

        public List<ILinkExtractor> getLinkExtractors()
        Gets link extractors.
        Returns:
        link extractors
      • setLinkExtractors

        public void setLinkExtractors​(ILinkExtractor... linkExtractors)
        Sets link extractors.
        Parameters:
        linkExtractors - link extractors
      • setLinkExtractors

        public void setLinkExtractors​(List<ILinkExtractor> linkExtractors)
        Sets link extractors.
        Parameters:
        linkExtractors - link extractors
        Since:
        3.0.0
      • setRobotsTxtProvider

        public void setRobotsTxtProvider​(IRobotsTxtProvider robotsTxtProvider)
      • getUrlNormalizers

        public List<IURLNormalizer> getUrlNormalizers()
        Gets URL normalizers. Defaults to a single GenericURLNormalizer instance (with its default configuration).
        Returns:
        URL normalizers or an empty list (never null)
        Since:
        3.1.0
      • setUrlNormalizers

        public void setUrlNormalizers​(List<IURLNormalizer> urlNormalizers)
        Sets URL normalizers.
        Parameters:
        urlNormalizers - URL normalizers
        Since:
        3.1.0
      • setDelayResolver

        public void setDelayResolver​(IDelayResolver delayResolver)
      • getPreImportProcessors

        public List<IHttpDocumentProcessor> getPreImportProcessors()
        Gets pre-import processors.
        Returns:
        pre-import processors
      • setPreImportProcessors

        public void setPreImportProcessors​(IHttpDocumentProcessor... preImportProcessors)
        Sets pre-import processors.
        Parameters:
        preImportProcessors - pre-import processors
      • setPreImportProcessors

        public void setPreImportProcessors​(List<IHttpDocumentProcessor> preImportProcessors)
        Sets pre-import processors.
        Parameters:
        preImportProcessors - pre-import processors
        Since:
        3.0.0
      • getPostImportProcessors

        public List<IHttpDocumentProcessor> getPostImportProcessors()
        Gets post-import processors.
        Returns:
        post-import processors
      • setPostImportProcessors

        public void setPostImportProcessors​(IHttpDocumentProcessor... postImportProcessors)
        Sets post-import processors.
        Parameters:
        postImportProcessors - post-import processors
      • setPostImportProcessors

        public void setPostImportProcessors​(List<IHttpDocumentProcessor> postImportProcessors)
        Sets post-import processors.
        Parameters:
        postImportProcessors - post-import processors
        Since:
        3.0.0
      • isIgnoreRobotsTxt

        public boolean isIgnoreRobotsTxt()
      • setIgnoreRobotsTxt

        public void setIgnoreRobotsTxt​(boolean ignoreRobotsTxt)
      • isKeepDownloads

        public boolean isKeepDownloads()
      • setKeepDownloads

        public void setKeepDownloads​(boolean keepDownloads)
      • getKeepReferencedLinks

        public Set<HttpCrawlerConfig.ReferencedLinkType> getKeepReferencedLinks()
        Gets what type of referenced links to keep, if any. Those links are URLs extracted by link extractors. See class documentation for more details.
        Returns:
        preferences for keeping links
        Since:
        3.0.0
      • setKeepReferencedLinks

        public void setKeepReferencedLinks​(Set<HttpCrawlerConfig.ReferencedLinkType> keepReferencedLinks)
        Sets whether to keep referenced links and what to keep. Those links are URLs extracted by link extractors. See class documentation for more details.
        Parameters:
        keepReferencedLinks - option for keeping links
        Since:
        3.0.0
      • setKeepReferencedLinks

        public void setKeepReferencedLinks​(HttpCrawlerConfig.ReferencedLinkType... keepReferencedLinks)
        Sets whether to keep referenced links and what to keep. Those links are URLs extracted by link extractors. See class documentation for more details.
        Parameters:
        keepReferencedLinks - option for keeping links
        Since:
        3.0.0
      • isIgnoreRobotsMeta

        public boolean isIgnoreRobotsMeta()
      • setIgnoreRobotsMeta

        public void setIgnoreRobotsMeta​(boolean ignoreRobotsMeta)
      • setRobotsMetaProvider

        public void setRobotsMetaProvider​(IRobotsMetaProvider robotsMetaProvider)
      • isIgnoreSitemap

        public boolean isIgnoreSitemap()
        Whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()) are never ignored.
        Returns:
        true to ignore sitemaps
      • setIgnoreSitemap

        public void setIgnoreSitemap​(boolean ignoreSitemap)
        Sets whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()) are never ignored.
        Parameters:
        ignoreSitemap - true to ignore sitemaps
      • setSitemapResolver

        public void setSitemapResolver​(ISitemapResolver sitemapResolver)
      • isIgnoreCanonicalLinks

        public boolean isIgnoreCanonicalLinks()
        Whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. When processed (default), URL pages with a canonical URL pointer in them are not processed.
        Returns:
        true if ignoring canonical links processed.
        Since:
        2.2.0
      • setIgnoreCanonicalLinks

        public void setIgnoreCanonicalLinks​(boolean ignoreCanonicalLinks)
        Sets whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. If true URL pages with a canonical URL pointer in them are not
        Parameters:
        ignoreCanonicalLinks - true if ignoring canonical links
        Since:
        2.2.0
      • getURLCrawlScopeStrategy

        public URLCrawlScopeStrategy getURLCrawlScopeStrategy()
        Gets the strategy to use to determine if a URL is in scope.
        Returns:
        the strategy
      • setUrlCrawlScopeStrategy

        public void setUrlCrawlScopeStrategy​(URLCrawlScopeStrategy urlCrawlScopeStrategy)
        Sets the strategy to use to determine if a URL is in scope.
        Parameters:
        urlCrawlScopeStrategy - strategy to use
        Since:
        2.8.1
      • getRecrawlableResolver

        public IRecrawlableResolver getRecrawlableResolver()
        Gets the recrawlable resolver.
        Returns:
        recrawlable resolver
        Since:
        2.5.0
      • setRecrawlableResolver

        public void setRecrawlableResolver​(IRecrawlableResolver recrawlableResolver)
        Sets the recrawlable resolver.
        Parameters:
        recrawlableResolver - the recrawlable resolver
        Since:
        2.5.0
      • getPostImportLinks

        public TextMatcher getPostImportLinks()
        Gets a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.
        Returns:
        field matcher
        Since:
        3.0.0
      • setPostImportLinks

        public void setPostImportLinks​(TextMatcher fieldMatcher)
        Set a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.
        Parameters:
        fieldMatcher - field matcher
        Since:
        3.0.0
      • isPostImportLinksKeep

        public boolean isPostImportLinksKeep()
        Gets whether to keep the importer-generated field holding URLs to consider for crawling.
        Returns:
        true if keeping
        Since:
        3.0.0
      • setPostImportLinksKeep

        public void setPostImportLinksKeep​(boolean postImportLinksKeep)
        Sets whether to keep the importer-generated field holding URLs to consider for crawling.
        Parameters:
        postImportLinksKeep - true if keeping
        Since:
        3.0.0