Class ReferenceDelayResolver
- java.lang.Object
-
- com.norconex.collector.http.delay.impl.AbstractDelayResolver
-
- com.norconex.collector.http.delay.impl.ReferenceDelayResolver
-
- All Implemented Interfaces:
IDelayResolver
,IXMLConfigurable
public class ReferenceDelayResolver extends AbstractDelayResolver
Introduces different delays between document downloads based on matching document reference (URL) patterns. There are a few ways the actual delay value can be defined (in order):
- Takes the delay specify by a robots.txt file. Only applicable if robots.txt files and its robots crawl delays are not ignored.
- Takes the delay matching a reference pattern, if any (picks the first one matching).
- Used the specified default delay or 3 seconds, if none is specified.
One of these following scope dictates how the delay is applied, listed in order from the best behaved to the least.
- crawler: the delay is applied between each URL download within a crawler instance, regardless how many threads are defined within that crawler, or whether URLs are from the same site or not. This is the default scope.
- site: the delay is applied between each URL download from the same site within a crawler instance, regardless how many threads are defined. A site is defined by a URL protocol and its domain (e.g. http://example.com).
- thread: the delay is applied between each URL download from any given thread. The more threads you have the less of an impact the delay will have.
As of 2.7.0, XML configuration entries expecting millisecond durations can be provided in human-readable format (English only), as per
DurationParser
(e.g., "5 minutes and 30 seconds" or "5m30s").XML configuration usage:
<delay class="com.norconex.collector.http.delay.impl.ReferenceDelayResolver" default="(milliseconds)" ignoreRobotsCrawlDelay="[false|true]" scope="[crawler|site|thread]"> <pattern delay="(delay in milliseconds)"> (regular expression applied against document reference) </pattern> (... repeat pattern tag as needed ...) </delay>
XML usage example:
<pre> <delay class="ReferenceDelayResolver" default="3 seconds"> <pattern delay="10 seconds"> .*\.pdf </pattern> </delay>
The above examlpe will increase the delay to 10 seconds when encountering PDFs from a default of 3 seconds.
- Since:
- 2.5.0
- Author:
- Pascal Essiembre
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
ReferenceDelayResolver.DelayReferencePattern
-
Field Summary
-
Fields inherited from class com.norconex.collector.http.delay.impl.AbstractDelayResolver
DEFAULT_DELAY, SCOPE_CRAWLER, SCOPE_SITE, SCOPE_THREAD
-
-
Constructor Summary
Constructors Constructor Description ReferenceDelayResolver()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
List<ReferenceDelayResolver.DelayReferencePattern>
getDelayReferencePatterns()
int
hashCode()
protected void
loadDelaysFromXML(XML xml)
Loads explicit configuration of delays form XML.protected long
resolveExplicitDelay(String url)
Resolves explicitly specified delay, in milliseconds.protected void
saveDelaysToXML(XML xml)
Saves explicit configuration of delays to XML.void
setDelayReferencePatterns(List<ReferenceDelayResolver.DelayReferencePattern> delayPatterns)
String
toString()
-
Methods inherited from class com.norconex.collector.http.delay.impl.AbstractDelayResolver
delay, getDefaultDelay, getScope, isIgnoreRobotsCrawlDelay, loadFromXML, saveToXML, setDefaultDelay, setIgnoreRobotsCrawlDelay, setScope
-
-
-
-
Method Detail
-
getDelayReferencePatterns
public List<ReferenceDelayResolver.DelayReferencePattern> getDelayReferencePatterns()
-
setDelayReferencePatterns
public void setDelayReferencePatterns(List<ReferenceDelayResolver.DelayReferencePattern> delayPatterns)
-
resolveExplicitDelay
protected long resolveExplicitDelay(String url)
Description copied from class:AbstractDelayResolver
Resolves explicitly specified delay, in milliseconds. This method is only invoked when there are no delays from robots.txt. If the implementing class does not have a delay resolution, -1 is returned (the default delay will be used).- Specified by:
resolveExplicitDelay
in classAbstractDelayResolver
- Parameters:
url
- URL for which to resolve delay- Returns:
- delay in millisecond, or -1
-
loadDelaysFromXML
protected void loadDelaysFromXML(XML xml)
Description copied from class:AbstractDelayResolver
Loads explicit configuration of delays form XML. Implementors should override this method if they wish to add extra configurable elements. Default implementation does nothing.- Overrides:
loadDelaysFromXML
in classAbstractDelayResolver
- Parameters:
xml
- configuration
-
saveDelaysToXML
protected void saveDelaysToXML(XML xml)
Description copied from class:AbstractDelayResolver
Saves explicit configuration of delays to XML. Implementors should override this method if they wish to add extra configurable elements. Default implementation does nothing.- Overrides:
saveDelaysToXML
in classAbstractDelayResolver
- Parameters:
xml
- XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractDelayResolver
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractDelayResolver
-
toString
public String toString()
- Overrides:
toString
in classAbstractDelayResolver
-
-