public class GenericRecrawlableResolver extends Object implements IRecrawlableResolver, IXMLConfigurable
Relies on both sitemap directives and custom instructions for establishing the minimum frequency between each document recrawl.
Provided crawler support for sitemaps has not been disabled, this class tries to honor last modified and frequency directives found in sitemap files.
By default, existing sitemap directives take precedence over custom ones.
You chose to have sitemap directives be considered last or even disable
sitemap directives using the setSitemapSupport(SitemapSupport)
method.
You can chose to have some of your crawled documents be re-crawled less
frequently than others by specifying custom minimum frequencies
(setMinFrequencies(MinFrequency...)
). Minimum frequencies are
processed in the order specified and must each have to following:
As of 2.7.0, XML configuration entries expecting millisecond durations
can be provided in human-readable format (English only), as per
DurationParser
(e.g., "5 minutes and 30 seconds" or "5m30s").
<recrawlableResolver
class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
sitemapSupport="[first|last|never]">
<minFrequency
applyTo="[reference|contentType]"
caseSensitive="[false|true]"
value="([always|hourly|daily|weekly|monthly|yearly|never] or milliseconds)">
(regex pattern)
</minFrequency>
(... repeat frequency tag as needed ...)
</recrawlableResolver>
<recrawlableResolver
class="com.norconex.collector.http.recrawl.impl.GenericRecrawlableResolver"
sitemapSupport="last">
<minFrequency
applyTo="contentType"
value="monthly">
application/pdf
</minFrequency>
<minFrequency
applyTo="reference"
value="1800000">
.*latest-news.*\.html
</minFrequency>
</recrawlableResolver>
The above example ensures PDFs are re-crawled no more frequently than once a month, while HTML news can be re-crawled as fast at every half hour. For the rest, it relies on the website sitemap directives (if any).
Modifier and Type | Class and Description |
---|---|
static class |
GenericRecrawlableResolver.MinFrequency |
static class |
GenericRecrawlableResolver.SitemapSupport |
Constructor and Description |
---|
GenericRecrawlableResolver() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
List<GenericRecrawlableResolver.MinFrequency> |
getMinFrequencies()
Gets minimum frequencies.
|
GenericRecrawlableResolver.SitemapSupport |
getSitemapSupport()
Gets the sitemap support strategy.
|
int |
hashCode() |
boolean |
isRecrawlable(HttpDocInfo prevData)
Whether a document recrawlable or not.
|
void |
loadFromXML(XML xml) |
void |
saveToXML(XML xml) |
void |
setMinFrequencies(Collection<GenericRecrawlableResolver.MinFrequency> minFrequencies)
Sets minimum frequencies.
|
void |
setMinFrequencies(GenericRecrawlableResolver.MinFrequency... minFrequencies)
Sets minimum frequencies.
|
void |
setSitemapSupport(GenericRecrawlableResolver.SitemapSupport sitemapSupport)
Sets the sitemap support strategy.
|
String |
toString() |
public GenericRecrawlableResolver.SitemapSupport getSitemapSupport()
GenericRecrawlableResolver.SitemapSupport.FIRST
.public void setSitemapSupport(GenericRecrawlableResolver.SitemapSupport sitemapSupport)
null
value
is equivalent to specifying the default GenericRecrawlableResolver.SitemapSupport.FIRST
.sitemapSupport
- sitemap support strategypublic List<GenericRecrawlableResolver.MinFrequency> getMinFrequencies()
public void setMinFrequencies(GenericRecrawlableResolver.MinFrequency... minFrequencies)
minFrequencies
- minimum frequenciespublic void setMinFrequencies(Collection<GenericRecrawlableResolver.MinFrequency> minFrequencies)
minFrequencies
- minimum frequenciespublic boolean isRecrawlable(HttpDocInfo prevData)
IRecrawlableResolver
isRecrawlable
in interface IRecrawlableResolver
prevData
- data about previously crawled document (if any)true
if recrawlablepublic void loadFromXML(XML xml)
loadFromXML
in interface IXMLConfigurable
public void saveToXML(XML xml)
saveToXML
in interface IXMLConfigurable
Copyright © 2009–2023 Norconex Inc.. All rights reserved.