Class GenericRecrawlableResolver
- java.lang.Object
-
- com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver
-
- All Implemented Interfaces:
IRecrawlableResolver
,IXMLConfigurable
public class GenericRecrawlableResolver extends Object implements IRecrawlableResolver, IXMLConfigurable
Relies on both sitemap directives and custom instructions for establishing the minimum frequency between each document recrawl.
Sitemap support:
Provided crawler support for sitemaps has not been disabled, this class tries to honor last modified and frequency directives found in sitemap files.
By default, existing sitemap directives take precedence over custom ones. You chose to have sitemap directives be considered last or even disable sitemap directives using the
setSitemapSupport(SitemapSupport)
method.Custom recrawl frequencies:
You can chose to have some of your crawled documents be re-crawled less frequently than others by specifying custom minimum frequencies (
setMinFrequencies(MinFrequency...)
). Minimum frequencies are processed in the order specified and must each have to following:- applyTo: Either "reference" or "contentType" (defaults to "reference").
- pattern: A regular expression.
- value: one of "always", "hourly", "daily", "weekly", "monthly", "yearly", "never", or a numeric value in milliseconds.
As of 2.7.0, XML configuration entries expecting millisecond durations can be provided in human-readable format (English only), as per
DurationParser
(e.g., "5 minutes and 30 seconds" or "5m30s").XML configuration usage:
<recrawlableResolver class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver" sitemapSupport="[first|last|never]"> <minFrequency applyTo="[reference|contentType]" caseSensitive="[false|true]" value="([always|hourly|daily|weekly|monthly|yearly|never] or milliseconds)"> (regex pattern) </minFrequency> (... repeat frequency tag as needed ...) </recrawlableResolver>
XML usage example:
<recrawlableResolver class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver" sitemapSupport="last"> <minFrequency applyTo="contentType" value="monthly"> application/pdf </minFrequency> <minFrequency applyTo="reference" value="1800000"> .*latest-news.*\.html </minFrequency> </recrawlableResolver>
The above example ensures PDFs are re-crawled no more frequently than once a month, while HTML news can be re-crawled as fast at every half hour. For the rest, it relies on the website sitemap directives (if any).
- Since:
- 2.5.0
- Author:
- Pascal Essiembre
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
GenericRecrawlableResolver.MinFrequency
static class
GenericRecrawlableResolver.SitemapSupport
-
Constructor Summary
Constructors Constructor Description GenericRecrawlableResolver()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
List<GenericRecrawlableResolver.MinFrequency>
getMinFrequencies()
Gets minimum frequencies.GenericRecrawlableResolver.SitemapSupport
getSitemapSupport()
Gets the sitemap support strategy.int
hashCode()
boolean
isRecrawlable(HttpDocInfo prevData)
Whether a document recrawlable or not.void
loadFromXML(XML xml)
void
saveToXML(XML xml)
void
setMinFrequencies(GenericRecrawlableResolver.MinFrequency... minFrequencies)
Sets minimum frequencies.void
setMinFrequencies(Collection<GenericRecrawlableResolver.MinFrequency> minFrequencies)
Sets minimum frequencies.void
setSitemapSupport(GenericRecrawlableResolver.SitemapSupport sitemapSupport)
Sets the sitemap support strategy.String
toString()
-
-
-
Method Detail
-
getSitemapSupport
public GenericRecrawlableResolver.SitemapSupport getSitemapSupport()
Gets the sitemap support strategy. Defualt isGenericRecrawlableResolver.SitemapSupport.FIRST
.- Returns:
- sitemap support strategy
-
setSitemapSupport
public void setSitemapSupport(GenericRecrawlableResolver.SitemapSupport sitemapSupport)
Sets the sitemap support strategy. Anull
value is equivalent to specifying the defaultGenericRecrawlableResolver.SitemapSupport.FIRST
.- Parameters:
sitemapSupport
- sitemap support strategy
-
getMinFrequencies
public List<GenericRecrawlableResolver.MinFrequency> getMinFrequencies()
Gets minimum frequencies.- Returns:
- minimum frequencies
-
setMinFrequencies
public void setMinFrequencies(GenericRecrawlableResolver.MinFrequency... minFrequencies)
Sets minimum frequencies.- Parameters:
minFrequencies
- minimum frequencies
-
setMinFrequencies
public void setMinFrequencies(Collection<GenericRecrawlableResolver.MinFrequency> minFrequencies)
Sets minimum frequencies.- Parameters:
minFrequencies
- minimum frequencies- Since:
- 3.0.0
-
isRecrawlable
public boolean isRecrawlable(HttpDocInfo prevData)
Description copied from interface:IRecrawlableResolver
Whether a document recrawlable or not.- Specified by:
isRecrawlable
in interfaceIRecrawlableResolver
- Parameters:
prevData
- data about previously crawled document (if any)- Returns:
true
if recrawlable
-
loadFromXML
public void loadFromXML(XML xml)
- Specified by:
loadFromXML
in interfaceIXMLConfigurable
-
saveToXML
public void saveToXML(XML xml)
- Specified by:
saveToXML
in interfaceIXMLConfigurable
-
-