Class GenericDelayResolver

  • All Implemented Interfaces:
    IDelayResolver, IXMLConfigurable

    public class GenericDelayResolver
    extends AbstractDelayResolver

    Default implementation for creating voluntary delays between URL downloads. There are a few ways the actual delay value can be defined (in order):

    1. Takes the delay specify by a robots.txt file. Only applicable if robots.txt files and its robots crawl delays are not ignored.
    2. Takes an explicitly scheduled delay, if any (picks the first one matching).
    3. Use the specified default delay or 3 seconds, if none is specified.

    In a delay schedule, the days of weeks are spelled out (in English): Monday, Tuesday, etc. Time ranges are using the 24h format.

    One of these following scope dictates how the delay is applied, listed in order from the best behaved to the least.

    • crawler: the delay is applied between each URL download within a crawler instance, regardless how many threads are defined within that crawler, or whether URLs are from the same site or not. This is the default scope.
    • site: the delay is applied between each URL download from the same site within a crawler instance, regardless how many threads are defined. A site is defined by a URL protocol and its domain (e.g. http://example.com).
    • thread: the delay is applied between each URL download from any given thread. The more threads you have the less of an impact the delay will have.

    As of 2.7.0, XML configuration entries expecting millisecond durations can be provided in human-readable format (English only), as per DurationParser (e.g., "5 minutes and 30 seconds" or "5m30s").

    XML configuration usage:

    
    <delay
        class="com.norconex.collector.http.delay.impl.GenericDelayResolver"
        default="(milliseconds)"
        ignoreRobotsCrawlDelay="[false|true]"
        scope="[crawler|site|thread]">
      <schedule
          dayOfWeek="from (week day) to (week day)"
          dayOfMonth="from [1-31] to [1-31]"
          time="from (HH:mm) to (HH:mm)">
        (delay in milliseconds)
      </schedule>
      (... repeat schedule tag as needed ...)
    </delay>

    XML usage example:

    
    <delay
        class="GenericDelayResolver"
        default="5 seconds"
        ignoreRobotsCrawlDelay="true"
        scope="site">
      <schedule
          dayOfWeek="from Saturday to Sunday">
        1 second
      </schedule>
    </delay>

    The above example set the minimum delay between each document download on a given site to 5 seconds, no matter what the crawler robots.txt may say, except on weekend, where it is more agressive (1 second).

    Author:
    Pascal Essiembre