Class GenericSitemapResolver
- java.lang.Object
-
- com.norconex.collector.core.crawler.CrawlerLifeCycleListener
-
- com.norconex.collector.http.sitemap.impl.GenericSitemapResolver
-
- All Implemented Interfaces:
ISitemapResolver
,IEventListener<CrawlerEvent>
,IXMLConfigurable
,EventListener
,Consumer<CrawlerEvent>
public class GenericSitemapResolver extends CrawlerLifeCycleListener implements ISitemapResolver, IXMLConfigurable
Implementation of
ISitemapResolver
as per sitemap.xml standard defined at http://www.sitemaps.org/protocol.html.Sitemaps are only resolved if they have not been resolved already for the same URL "root" (the protocol, host and possible port).
The Sitemap specification dictates that a sitemap.xml file defined in a sub-directory applies only to URLs found in that sub-directory and its children. This behavior is respected by default. Setting lenient to
true
no longer honors this restriction.Paths relative to URL roots can be specified and an attempt will be made to load and parse any sitemap found at those locations for each root URLs encountered (except for "start URLs" sitemaps, see below). Default paths are
/sitemap.xml
and/sitemap_index.xml
. Settingnull
or an empty path array onsetSitemapPaths(String...)
will prevent attempts to locate sitemaps and only sitemaps found in robots.txt or defined as start URLs will be considered.Sitemaps can be specified as "start URLs" (defined in
HttpCrawlerConfig.getStartSitemapURLs()
). Sitemaps defined that way will be the only ones resolved for the root URL they represent (sitemap paths or sitemaps defined in robots.txt won't apply).Sitemaps are first stored in a local temporary file before being parsed. A directory relative to the crawler work directory will be created by default. To specify a custom directory, you can use
setTempDir(Path)
.- Since:
- 3.0.0 (merged fro StandardSitemapResolver*)
- Author:
- Pascal Essiembre
-
-
Field Summary
Fields Modifier and Type Field Description static List<String>
DEFAULT_SITEMAP_PATHS
-
Constructor Summary
Constructors Constructor Description GenericSitemapResolver()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
List<String>
getSitemapPaths()
Gets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.Path
getTempDir()
Gets the directory where temporary sitemap files are written.int
hashCode()
boolean
isLenient()
void
loadFromXML(XML xml)
protected void
onCrawlerCleanBegin(CrawlerEvent event)
protected void
onCrawlerEvent(CrawlerEvent event)
protected void
onCrawlerRunBegin(CrawlerEvent event)
protected void
onCrawlerStopBegin(CrawlerEvent event)
void
resolveSitemaps(HttpFetchClient fetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs)
Resolves the sitemap instructions for a URL "root" (e.g.void
saveToXML(XML xml)
void
setLenient(boolean lenient)
void
setSitemapPaths(String... sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.void
setSitemapPaths(List<String> sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.void
setTempDir(Path tempDir)
Sets the directory where temporary sitemap files are written.String
toString()
-
Methods inherited from class com.norconex.collector.core.crawler.CrawlerLifeCycleListener
accept, onCrawlerCleanEnd, onCrawlerInitBegin, onCrawlerInitEnd, onCrawlerRunEnd, onCrawlerRunThreadBegin, onCrawlerRunThreadEnd, onCrawlerShutdown, onCrawlerStopEnd
-
-
-
-
Method Detail
-
onCrawlerRunBegin
protected void onCrawlerRunBegin(CrawlerEvent event)
- Overrides:
onCrawlerRunBegin
in classCrawlerLifeCycleListener
-
onCrawlerStopBegin
protected void onCrawlerStopBegin(CrawlerEvent event)
- Overrides:
onCrawlerStopBegin
in classCrawlerLifeCycleListener
-
onCrawlerEvent
protected void onCrawlerEvent(CrawlerEvent event)
- Overrides:
onCrawlerEvent
in classCrawlerLifeCycleListener
-
onCrawlerCleanBegin
protected void onCrawlerCleanBegin(CrawlerEvent event)
- Overrides:
onCrawlerCleanBegin
in classCrawlerLifeCycleListener
-
getSitemapPaths
public List<String> getSitemapPaths()
Gets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps. Default paths are "/sitemap.xml" and "/sitemap-index.xml".- Returns:
- sitemap paths.
-
setSitemapPaths
public void setSitemapPaths(String... sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.- Parameters:
sitemapPaths
- sitemap paths.
-
setSitemapPaths
public void setSitemapPaths(List<String> sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.- Parameters:
sitemapPaths
- sitemap paths.
-
resolveSitemaps
public void resolveSitemaps(HttpFetchClient fetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs)
Description copied from interface:ISitemapResolver
Resolves the sitemap instructions for a URL "root" (e.g. http://www.example.com).- Specified by:
resolveSitemaps
in interfaceISitemapResolver
- Parameters:
fetcher
- the http fetcher executor to use to stream Internet files if neededurlRoot
- the URL root for which to resolve the sitemapsitemapLocations
- sitemap locations to resolvesitemapURLConsumer
- where to store retrieved site map URLsstartURLs
- whether the sitemapLocations provided (if any) are start URLs (defined inHttpCrawlerConfig.getStartSitemapURLs()
)
-
isLenient
public boolean isLenient()
-
setLenient
public void setLenient(boolean lenient)
-
getTempDir
public Path getTempDir()
Gets the directory where temporary sitemap files are written.- Returns:
- directory
-
setTempDir
public void setTempDir(Path tempDir)
Sets the directory where temporary sitemap files are written.- Parameters:
tempDir
- directory
-
loadFromXML
public void loadFromXML(XML xml)
- Specified by:
loadFromXML
in interfaceIXMLConfigurable
-
saveToXML
public void saveToXML(XML xml)
- Specified by:
saveToXML
in interfaceIXMLConfigurable
-
-