public class GenericSitemapResolver extends CrawlerLifeCycleListener implements ISitemapResolver, IXMLConfigurable
Implementation of ISitemapResolver
as per sitemap.xml standard
defined at
http://www.sitemaps.org/protocol.html.
Sitemaps are only resolved if they have not been resolved already for the same URL "root" (the protocol, host and possible port).
The Sitemap specification dictates that a sitemap.xml file defined
in a sub-directory applies only to URLs found in that sub-directory and
its children. This behavior is respected by default. Setting lenient
to true
no longer honors this restriction.
Paths relative to URL roots can be specified and an attempt will be made
to load and parse any sitemap found at those locations for each root URLs
encountered (except for "start URLs" sitemaps, see below). Default paths
are /sitemap.xml
and /sitemap_index.xml
.
Setting null
or an empty path array on
setSitemapPaths(String...)
will prevent attempts to locate
sitemaps and only sitemaps found in robots.txt or defined as start
URLs will be considered.
Sitemaps can be specified as "start URLs" (defined in
HttpCrawlerConfig.getStartSitemapURLs()
). Sitemaps defined
that way will be the only ones resolved for the root URL they represent
(sitemap paths or sitemaps defined in robots.txt won't apply).
Sitemaps are first stored in a local temporary file before
being parsed. A directory relative to the crawler work directory
will be created by default. To specify a custom directory, you can use
setTempDir(Path)
.
Modifier and Type | Field and Description |
---|---|
static List<String> |
DEFAULT_SITEMAP_PATHS |
Constructor and Description |
---|
GenericSitemapResolver() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
List<String> |
getSitemapPaths()
Gets the URL paths, relative to the URL root, from which to try
locate and resolve sitemaps.
|
Path |
getTempDir()
Gets the directory where temporary sitemap files are written.
|
int |
hashCode() |
boolean |
isLenient() |
void |
loadFromXML(XML xml) |
protected void |
onCrawlerCleanBegin(CrawlerEvent event) |
protected void |
onCrawlerEvent(CrawlerEvent event) |
protected void |
onCrawlerRunBegin(CrawlerEvent event) |
protected void |
onCrawlerStopBegin(CrawlerEvent event) |
void |
resolveSitemaps(HttpFetchClient fetcher,
String urlRoot,
List<String> sitemapLocations,
Consumer<HttpDocInfo> sitemapURLConsumer,
boolean startURLs)
Resolves the sitemap instructions for a URL "root" (e.g.
|
void |
saveToXML(XML xml) |
void |
setLenient(boolean lenient) |
void |
setSitemapPaths(List<String> sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try
locate and resolve sitemaps.
|
void |
setSitemapPaths(String... sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try
locate and resolve sitemaps.
|
void |
setTempDir(Path tempDir)
Sets the directory where temporary sitemap files are written.
|
String |
toString() |
accept, onCrawlerCleanEnd, onCrawlerInitBegin, onCrawlerInitEnd, onCrawlerRunEnd, onCrawlerRunThreadBegin, onCrawlerRunThreadEnd, onCrawlerShutdown, onCrawlerStopEnd
protected void onCrawlerRunBegin(CrawlerEvent event)
onCrawlerRunBegin
in class CrawlerLifeCycleListener
protected void onCrawlerStopBegin(CrawlerEvent event)
onCrawlerStopBegin
in class CrawlerLifeCycleListener
protected void onCrawlerEvent(CrawlerEvent event)
onCrawlerEvent
in class CrawlerLifeCycleListener
protected void onCrawlerCleanBegin(CrawlerEvent event)
onCrawlerCleanBegin
in class CrawlerLifeCycleListener
public List<String> getSitemapPaths()
public void setSitemapPaths(String... sitemapPaths)
sitemapPaths
- sitemap paths.public void setSitemapPaths(List<String> sitemapPaths)
sitemapPaths
- sitemap paths.public void resolveSitemaps(HttpFetchClient fetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs)
ISitemapResolver
resolveSitemaps
in interface ISitemapResolver
fetcher
- the http fetcher executor to use to stream Internet
files if neededurlRoot
- the URL root for which to resolve the sitemapsitemapLocations
- sitemap locations to resolvesitemapURLConsumer
- where to store retrieved site map URLsstartURLs
- whether the sitemapLocations provided (if any) are
start URLs (defined in HttpCrawlerConfig.getStartSitemapURLs()
)public boolean isLenient()
public void setLenient(boolean lenient)
public Path getTempDir()
public void setTempDir(Path tempDir)
tempDir
- directorypublic void loadFromXML(XML xml)
loadFromXML
in interface IXMLConfigurable
public void saveToXML(XML xml)
saveToXML
in interface IXMLConfigurable
Copyright © 2009–2023 Norconex Inc.. All rights reserved.