Class GenericSitemapResolver
- java.lang.Object
-
- com.norconex.collector.core.crawler.CrawlerLifeCycleListener
-
- com.norconex.collector.http.sitemap.impl.GenericSitemapResolver
-
- All Implemented Interfaces:
ISitemapResolver,IEventListener<CrawlerEvent>,IXMLConfigurable,EventListener,Consumer<CrawlerEvent>
public class GenericSitemapResolver extends CrawlerLifeCycleListener implements ISitemapResolver, IXMLConfigurable
Implementation of
ISitemapResolveras per sitemap.xml standard defined at http://www.sitemaps.org/protocol.html.Sitemaps are only resolved if they have not been resolved already for the same URL "root" (the protocol, host and possible port).
The Sitemap specification dictates that a sitemap.xml file defined in a sub-directory applies only to URLs found in that sub-directory and its children. This behavior is respected by default. Setting lenient to
trueno longer honors this restriction.Paths relative to URL roots can be specified and an attempt will be made to load and parse any sitemap found at those locations for each root URLs encountered (except for "start URLs" sitemaps, see below). Default paths are
/sitemap.xmland/sitemap_index.xml. Settingnullor an empty path array onsetSitemapPaths(String...)will prevent attempts to locate sitemaps and only sitemaps found in robots.txt or defined as start URLs will be considered.Sitemaps can be specified as "start URLs" (defined in
HttpCrawlerConfig.getStartSitemapURLs()). Sitemaps defined that way will be the only ones resolved for the root URL they represent (sitemap paths or sitemaps defined in robots.txt won't apply).Sitemaps are first stored in a local temporary file before being parsed. A directory relative to the crawler work directory will be created by default. To specify a custom directory, you can use
setTempDir(Path).- Since:
- 3.0.0 (merged fro StandardSitemapResolver*)
- Author:
- Pascal Essiembre
-
-
Field Summary
Fields Modifier and Type Field Description static List<String>DEFAULT_SITEMAP_PATHS
-
Constructor Summary
Constructors Constructor Description GenericSitemapResolver()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description booleanequals(Object other)List<String>getSitemapPaths()Gets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.PathgetTempDir()Gets the directory where temporary sitemap files are written.inthashCode()booleanisLenient()voidloadFromXML(XML xml)protected voidonCrawlerCleanBegin(CrawlerEvent event)protected voidonCrawlerEvent(CrawlerEvent event)protected voidonCrawlerRunBegin(CrawlerEvent event)protected voidonCrawlerStopBegin(CrawlerEvent event)voidresolveSitemaps(HttpFetchClient fetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs)Resolves the sitemap instructions for a URL "root" (e.g.voidsaveToXML(XML xml)voidsetLenient(boolean lenient)voidsetSitemapPaths(String... sitemapPaths)Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.voidsetSitemapPaths(List<String> sitemapPaths)Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.voidsetTempDir(Path tempDir)Sets the directory where temporary sitemap files are written.StringtoString()-
Methods inherited from class com.norconex.collector.core.crawler.CrawlerLifeCycleListener
accept, onCrawlerCleanEnd, onCrawlerInitBegin, onCrawlerInitEnd, onCrawlerRunEnd, onCrawlerRunThreadBegin, onCrawlerRunThreadEnd, onCrawlerShutdown, onCrawlerStopEnd
-
-
-
-
Method Detail
-
onCrawlerRunBegin
protected void onCrawlerRunBegin(CrawlerEvent event)
- Overrides:
onCrawlerRunBeginin classCrawlerLifeCycleListener
-
onCrawlerStopBegin
protected void onCrawlerStopBegin(CrawlerEvent event)
- Overrides:
onCrawlerStopBeginin classCrawlerLifeCycleListener
-
onCrawlerEvent
protected void onCrawlerEvent(CrawlerEvent event)
- Overrides:
onCrawlerEventin classCrawlerLifeCycleListener
-
onCrawlerCleanBegin
protected void onCrawlerCleanBegin(CrawlerEvent event)
- Overrides:
onCrawlerCleanBeginin classCrawlerLifeCycleListener
-
getSitemapPaths
public List<String> getSitemapPaths()
Gets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps. Default paths are "/sitemap.xml" and "/sitemap-index.xml".- Returns:
- sitemap paths.
-
setSitemapPaths
public void setSitemapPaths(String... sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.- Parameters:
sitemapPaths- sitemap paths.
-
setSitemapPaths
public void setSitemapPaths(List<String> sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.- Parameters:
sitemapPaths- sitemap paths.
-
resolveSitemaps
public void resolveSitemaps(HttpFetchClient fetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs)
Description copied from interface:ISitemapResolverResolves the sitemap instructions for a URL "root" (e.g. http://www.example.com).- Specified by:
resolveSitemapsin interfaceISitemapResolver- Parameters:
fetcher- the http fetcher executor to use to stream Internet files if neededurlRoot- the URL root for which to resolve the sitemapsitemapLocations- sitemap locations to resolvesitemapURLConsumer- where to store retrieved site map URLsstartURLs- whether the sitemapLocations provided (if any) are start URLs (defined inHttpCrawlerConfig.getStartSitemapURLs())
-
isLenient
public boolean isLenient()
-
setLenient
public void setLenient(boolean lenient)
-
getTempDir
public Path getTempDir()
Gets the directory where temporary sitemap files are written.- Returns:
- directory
-
setTempDir
public void setTempDir(Path tempDir)
Sets the directory where temporary sitemap files are written.- Parameters:
tempDir- directory
-
loadFromXML
public void loadFromXML(XML xml)
- Specified by:
loadFromXMLin interfaceIXMLConfigurable
-
saveToXML
public void saveToXML(XML xml)
- Specified by:
saveToXMLin interfaceIXMLConfigurable
-
-