Class GenericRecrawlableResolver
- All Implemented Interfaces:
IRecrawlableResolver,IXMLConfigurable
Relies on both sitemap directives and custom instructions for establishing the minimum frequency between each document recrawl.
Sitemap support:
Provided crawler support for sitemaps has not been disabled, this class tries to honor last modified and frequency directives found in sitemap files.
By default, existing sitemap directives take precedence over custom ones.
You chose to have sitemap directives be considered last or even disable
sitemap directives using the setSitemapSupport(SitemapSupport)
method.
Custom recrawl frequencies:
You can chose to have some of your crawled documents be re-crawled less
frequently than others by specifying custom minimum frequencies
(setMinFrequencies(MinFrequency...)). Minimum frequencies are
processed in the order specified and must each have to following:
- applyTo: Either "reference" or "contentType" (defaults to "reference").
- pattern: A regular expression.
- value: one of "always", "hourly", "daily", "weekly", "monthly", "yearly", "never", or a numeric value in milliseconds.
As of 2.7.0, XML configuration entries expecting millisecond durations
can be provided in human-readable format (English only), as per
DurationParser (e.g., "5 minutes and 30 seconds" or "5m30s").
XML configuration usage:
<recrawlableResolver
class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
sitemapSupport="[first|last|never]">
<minFrequency
applyTo="[reference|contentType]"
caseSensitive="[false|true]"
value="([always|hourly|daily|weekly|monthly|yearly|never] or milliseconds)">
(regex pattern)
</minFrequency>
(... repeat frequency tag as needed ...)
</recrawlableResolver>
XML usage example:
<recrawlableResolver
class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
sitemapSupport="last">
<minFrequency
applyTo="contentType"
value="monthly">
application/pdf
</minFrequency>
<minFrequency
applyTo="reference"
value="1800000">
.*latest-news.*\.html
</minFrequency>
</recrawlableResolver>
The above example ensures PDFs are re-crawled no more frequently than once a month, while HTML news can be re-crawled as fast at every half hour. For the rest, it relies on the website sitemap directives (if any).
- Since:
- 2.5.0
- Author:
- Pascal Essiembre
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classstatic enum -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbooleanGets minimum frequencies.Gets the sitemap support strategy.inthashCode()booleanisRecrawlable(HttpDocInfo prevData) Whether a document recrawlable or not.voidloadFromXML(XML xml) voidvoidsetMinFrequencies(GenericRecrawlableResolver.MinFrequency... minFrequencies) Sets minimum frequencies.voidsetMinFrequencies(Collection<GenericRecrawlableResolver.MinFrequency> minFrequencies) Sets minimum frequencies.voidsetSitemapSupport(GenericRecrawlableResolver.SitemapSupport sitemapSupport) Sets the sitemap support strategy.toString()
-
Constructor Details
-
GenericRecrawlableResolver
public GenericRecrawlableResolver()
-
-
Method Details
-
getSitemapSupport
Gets the sitemap support strategy. Defualt isGenericRecrawlableResolver.SitemapSupport.FIRST.- Returns:
- sitemap support strategy
-
setSitemapSupport
Sets the sitemap support strategy. Anullvalue is equivalent to specifying the defaultGenericRecrawlableResolver.SitemapSupport.FIRST.- Parameters:
sitemapSupport- sitemap support strategy
-
getMinFrequencies
Gets minimum frequencies.- Returns:
- minimum frequencies
-
setMinFrequencies
Sets minimum frequencies.- Parameters:
minFrequencies- minimum frequencies
-
setMinFrequencies
Sets minimum frequencies.- Parameters:
minFrequencies- minimum frequencies- Since:
- 3.0.0
-
isRecrawlable
Description copied from interface:IRecrawlableResolverWhether a document recrawlable or not.- Specified by:
isRecrawlablein interfaceIRecrawlableResolver- Parameters:
prevData- data about previously crawled document (if any)- Returns:
trueif recrawlable
-
loadFromXML
- Specified by:
loadFromXMLin interfaceIXMLConfigurable
-
saveToXML
- Specified by:
saveToXMLin interfaceIXMLConfigurable
-
equals
-
hashCode
public int hashCode() -
toString
-