Class GenericSitemapResolver
- All Implemented Interfaces:
ISitemapResolver,IEventListener<CrawlerEvent>,IXMLConfigurable,EventListener,Consumer<CrawlerEvent>
Implementation of ISitemapResolver as per sitemap.xml standard
defined at
http://www.sitemaps.org/protocol.html.
Sitemaps are only resolved if they have not been resolved already for the same URL "root" (the protocol, host and possible port).
The Sitemap specification dictates that a sitemap.xml file defined
in a sub-directory applies only to URLs found in that sub-directory and
its children. This behavior is respected by default. Setting lenient
to true no longer honors this restriction.
Paths relative to URL roots can be specified and an attempt will be made
to load and parse any sitemap found at those locations for each root URLs
encountered (except for "start URLs" sitemaps, see below). Default paths
are /sitemap.xml and /sitemap_index.xml.
Setting null or an empty path array on
setSitemapPaths(String...) will prevent attempts to locate
sitemaps and only sitemaps found in robots.txt or defined as start
URLs will be considered.
Sitemaps can be specified as "start URLs" (defined in
HttpCrawlerConfig.getStartSitemapURLs()). Sitemaps defined
that way will be the only ones resolved for the root URL they represent
(sitemap paths or sitemaps defined in robots.txt won't apply).
Sitemaps are first stored in a local temporary file before
being parsed. A directory relative to the crawler work directory
will be created by default. To specify a custom directory, you can use
setTempDir(Path).
- Since:
- 3.0.0 (merged fro StandardSitemapResolver*)
- Author:
- Pascal Essiembre
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbooleanGets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.Gets the directory where temporary sitemap files are written.inthashCode()booleanvoidloadFromXML(XML xml) protected voidonCrawlerCleanBegin(CrawlerEvent event) protected voidonCrawlerEvent(CrawlerEvent event) protected voidonCrawlerRunBegin(CrawlerEvent event) protected voidonCrawlerStopBegin(CrawlerEvent event) voidresolveSitemaps(HttpFetchClient fetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs) Resolves the sitemap instructions for a URL "root" (e.g.voidvoidsetLenient(boolean lenient) voidsetSitemapPaths(String... sitemapPaths) Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.voidsetSitemapPaths(List<String> sitemapPaths) Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.voidsetTempDir(Path tempDir) Sets the directory where temporary sitemap files are written.toString()Methods inherited from class com.norconex.collector.core.crawler.CrawlerLifeCycleListener
accept, onCrawlerCleanEnd, onCrawlerInitBegin, onCrawlerInitEnd, onCrawlerRunEnd, onCrawlerRunThreadBegin, onCrawlerRunThreadEnd, onCrawlerShutdown, onCrawlerStopEnd
-
Field Details
-
DEFAULT_SITEMAP_PATHS
-
-
Constructor Details
-
GenericSitemapResolver
public GenericSitemapResolver()
-
-
Method Details
-
onCrawlerRunBegin
- Overrides:
onCrawlerRunBeginin classCrawlerLifeCycleListener
-
onCrawlerStopBegin
- Overrides:
onCrawlerStopBeginin classCrawlerLifeCycleListener
-
onCrawlerEvent
- Overrides:
onCrawlerEventin classCrawlerLifeCycleListener
-
onCrawlerCleanBegin
- Overrides:
onCrawlerCleanBeginin classCrawlerLifeCycleListener
-
getSitemapPaths
Gets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps. Default paths are "/sitemap.xml" and "/sitemap-index.xml".- Returns:
- sitemap paths.
-
setSitemapPaths
Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.- Parameters:
sitemapPaths- sitemap paths.
-
setSitemapPaths
Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.- Parameters:
sitemapPaths- sitemap paths.
-
resolveSitemaps
public void resolveSitemaps(HttpFetchClient fetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs) Description copied from interface:ISitemapResolverResolves the sitemap instructions for a URL "root" (e.g. http://www.example.com).- Specified by:
resolveSitemapsin interfaceISitemapResolver- Parameters:
fetcher- the http fetcher executor to use to stream Internet files if neededurlRoot- the URL root for which to resolve the sitemapsitemapLocations- sitemap locations to resolvesitemapURLConsumer- where to store retrieved site map URLsstartURLs- whether the sitemapLocations provided (if any) are start URLs (defined inHttpCrawlerConfig.getStartSitemapURLs())
-
isLenient
public boolean isLenient() -
setLenient
public void setLenient(boolean lenient) -
getTempDir
Gets the directory where temporary sitemap files are written.- Returns:
- directory
-
setTempDir
Sets the directory where temporary sitemap files are written.- Parameters:
tempDir- directory
-
loadFromXML
- Specified by:
loadFromXMLin interfaceIXMLConfigurable
-
saveToXML
- Specified by:
saveToXMLin interfaceIXMLConfigurable
-
equals
-
hashCode
public int hashCode() -
toString
-