Class AbstractDelayResolver
- java.lang.Object
-
- com.norconex.collector.http.delay.impl.AbstractDelayResolver
-
- All Implemented Interfaces:
IDelayResolver
,IXMLConfigurable
- Direct Known Subclasses:
GenericDelayResolver
,ReferenceDelayResolver
public abstract class AbstractDelayResolver extends Object implements IDelayResolver, IXMLConfigurable
Base implementation for creating voluntary delays between URL downloads. This base class offers a few ways the actual delay value can be defined (in order):
- Takes the delay specify by a robots.txt file. Only applicable if robots.txt files and its robots crawl delays are not ignored.
- Takes an explicitly specified delay, as per implementing class.
- Use the specified default delay or 3 seconds, if none is specified.
One of these following scope dictates how the delay is applied, listed in order from the best behaved to the least.
- crawler: the delay is applied between each URL download within a crawler instance, regardless how many threads are defined within that crawler, or whether URLs are from the same site or not. This is the default scope.
- site: the delay is applied between each URL download from the same site within a crawler instance, regardless how many threads are defined. A site is defined by a URL protocol and its domain (e.g. http://example.com).
- thread: the delay is applied between each URL download from any given thread. The more threads you have the less of an impact the delay will have.
XML configuration usage:
The following should be shared across concrete implementations (which can add more configurable attributes and tags).
<delay class="(implementing class)" default="(milliseconds)" ignoreRobotsCrawlDelay="[false|true]" scope="[crawler|site|thread]"/>
- Since:
- 2.5.0
- Author:
- Pascal Essiembre
-
-
Field Summary
Fields Modifier and Type Field Description static long
DEFAULT_DELAY
Default delay is 3 seconds.static String
SCOPE_CRAWLER
static String
SCOPE_SITE
static String
SCOPE_THREAD
-
Constructor Summary
Constructors Constructor Description AbstractDelayResolver()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description void
delay(RobotsTxt robotsTxt, String url)
Delay crawling activities (if applicable).boolean
equals(Object other)
long
getDefaultDelay()
Gets the default delay in milliseconds.String
getScope()
Gets the delay scope.int
hashCode()
boolean
isIgnoreRobotsCrawlDelay()
Gets whether to ignore crawl delays specified in a site robots.txt file.protected void
loadDelaysFromXML(XML xml)
Loads explicit configuration of delays form XML.void
loadFromXML(XML xml)
protected abstract long
resolveExplicitDelay(String url)
Resolves explicitly specified delay, in milliseconds.protected void
saveDelaysToXML(XML xml)
Saves explicit configuration of delays to XML.void
saveToXML(XML xml)
void
setDefaultDelay(long defaultDelay)
Sets the default delay in milliseconds.void
setIgnoreRobotsCrawlDelay(boolean ignoreRobotsCrawlDelay)
Sets whether to ignore crawl delays specified in a site robots.txt file.void
setScope(String scope)
Sets the delay scope.String
toString()
-
-
-
Field Detail
-
SCOPE_CRAWLER
public static final String SCOPE_CRAWLER
- See Also:
- Constant Field Values
-
SCOPE_SITE
public static final String SCOPE_SITE
- See Also:
- Constant Field Values
-
SCOPE_THREAD
public static final String SCOPE_THREAD
- See Also:
- Constant Field Values
-
DEFAULT_DELAY
public static final long DEFAULT_DELAY
Default delay is 3 seconds.- See Also:
- Constant Field Values
-
-
Method Detail
-
delay
public void delay(RobotsTxt robotsTxt, String url)
Description copied from interface:IDelayResolver
Delay crawling activities (if applicable).- Specified by:
delay
in interfaceIDelayResolver
- Parameters:
robotsTxt
- robots.txt instance (if applicable)url
- the URL being crawled
-
getDefaultDelay
public long getDefaultDelay()
Gets the default delay in milliseconds.- Returns:
- default delay
-
setDefaultDelay
public void setDefaultDelay(long defaultDelay)
Sets the default delay in milliseconds.- Parameters:
defaultDelay
- default deleay
-
isIgnoreRobotsCrawlDelay
public boolean isIgnoreRobotsCrawlDelay()
Gets whether to ignore crawl delays specified in a site robots.txt file. Not applicable when robots.txt are ignored.- Returns:
true
if ignoring robots.txt crawl delay
-
setIgnoreRobotsCrawlDelay
public void setIgnoreRobotsCrawlDelay(boolean ignoreRobotsCrawlDelay)
Sets whether to ignore crawl delays specified in a site robots.txt file. Not applicable when robots.txt are ignored.- Parameters:
ignoreRobotsCrawlDelay
-true
if ignoring robots.txt crawl delay
-
getScope
public String getScope()
Gets the delay scope.- Returns:
- delay scope
-
setScope
public void setScope(String scope)
Sets the delay scope.- Parameters:
scope
- one of "crawler", "site", or "thread".
-
resolveExplicitDelay
protected abstract long resolveExplicitDelay(String url)
Resolves explicitly specified delay, in milliseconds. This method is only invoked when there are no delays from robots.txt. If the implementing class does not have a delay resolution, -1 is returned (the default delay will be used).- Parameters:
url
- URL for which to resolve delay- Returns:
- delay in millisecond, or -1
-
loadFromXML
public final void loadFromXML(XML xml)
- Specified by:
loadFromXML
in interfaceIXMLConfigurable
-
loadDelaysFromXML
protected void loadDelaysFromXML(XML xml)
Loads explicit configuration of delays form XML. Implementors should override this method if they wish to add extra configurable elements. Default implementation does nothing.- Parameters:
xml
- configuration
-
saveToXML
public final void saveToXML(XML xml)
- Specified by:
saveToXML
in interfaceIXMLConfigurable
-
saveDelaysToXML
protected void saveDelaysToXML(XML xml)
Saves explicit configuration of delays to XML. Implementors should override this method if they wish to add extra configurable elements. Default implementation does nothing.- Parameters:
xml
- XML
-
-