public class StandardSitemapResolver extends Object implements ISitemapResolver
Implementation of ISitemapResolver as per sitemap.xml standard
defined at
http://www.sitemaps.org/protocol.html.
Sitemaps are only resolved if they have not been resolved already for the same URL "root" (the protocol, host and possible port).
The Sitemap specifications dictates that a sitemap.xml file defined
in a sub-directory applies only to URLs found in that sub-directory and
its children. This behavior is respected by default. Setting lenient
to true no longer honors this restriction.
Since 2.9.1, setting lenient will also attempt to parse
XML values with invalid entities.
Paths relative to URL roots can be specified and an attempt will be made
to load and parse any sitemap found at those locations for each root URLs
encountered (except for "start URLs" sitemaps, see below). Default paths
are /sitemap.xml and /sitemap_index.xml.
Setting null or an empty path array on
setSitemapPaths(String...) will prevent attempts to locate
sitemaps and only sitemaps found in robots.txt or defined as start
URLs will be considered.
Sitemaps can be specified as "start URLs" (defined in
HttpCrawlerConfig.getStartSitemapURLs()). Sitemaps defined
that way will be the only ones resolved for the root URL they represent
(sitemap paths or sitemaps defined in robots.txt won't apply).
Sitemaps are first stored in a local temporary file before
being parsed. The tempDir constructor argument is used as the
location where to store those files. When null, the system
temporary directory is used, as returned by
FileUtils.getTempDirectoryPath().
| Modifier and Type | Field and Description |
|---|---|
static String[] |
DEFAULT_SITEMAP_PATHS |
| Constructor and Description |
|---|
StandardSitemapResolver(File tempDir,
SitemapStore sitemapStore) |
| Modifier and Type | Method and Description |
|---|---|
boolean |
equals(Object other) |
long |
getFromDate()
Gets the minimum EPOCH date (in milliseconds) a sitemap entry
should have to be considered.
|
String[] |
getSitemapLocations()
Deprecated.
Since 2.3.0, use
HttpCrawlerConfig.getStartSitemapURLs() |
String[] |
getSitemapPaths()
Gets the URL paths, relative to the URL root, from which to try
locate and resolve sitemaps.
|
File |
getTempDir()
Gets the directory where temporary sitemap files are written.
|
int |
hashCode() |
boolean |
isEscalateErrors()
Gets whether errors should be thrown instead of logged.
|
boolean |
isLenient() |
void |
resolveSitemaps(org.apache.http.client.HttpClient httpClient,
String urlRoot,
String[] sitemapLocations,
SitemapURLAdder sitemapURLAdder,
boolean startURLs)
Resolves the sitemap instructions for a URL "root" (e.g.
|
void |
setEscalateErrors(boolean escalateErrors)
Sets whether errors should be thrown instead of logged.
|
void |
setFromDate(long fromDate)
Sets the minimum EPOCH date (in milliseconds) a sitemap entry
should have to be considered.
|
void |
setLenient(boolean lenient) |
void |
setSitemapLocations(String... sitemapLocations)
Deprecated.
Since 2.3.0, use
HttpCrawlerConfig.setStartSitemapURLs(String[]) |
void |
setSitemapPaths(String... sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try
locate and resolve sitemaps.
|
void |
setTempDir(File tempDir)
Sets the directory where temporary sitemap files are written.
|
void |
stop()
Stops any ongoing sitemap resolution.
|
String |
toString() |
public static final String[] DEFAULT_SITEMAP_PATHS
public StandardSitemapResolver(File tempDir, SitemapStore sitemapStore)
public String[] getSitemapPaths()
public void setSitemapPaths(String... sitemapPaths)
sitemapPaths - sitemap paths.public void resolveSitemaps(org.apache.http.client.HttpClient httpClient,
String urlRoot,
String[] sitemapLocations,
SitemapURLAdder sitemapURLAdder,
boolean startURLs)
ISitemapResolverresolveSitemaps in interface ISitemapResolverhttpClient - the http client to use to stream Internet
files if neededurlRoot - the URL root for which to resolve the sitemapsitemapLocations - sitemap locations to resolvesitemapURLAdder - where to store retrieved site map URLsstartURLs - whether the sitemapLocations provided (if any) are
start URLs (defined in HttpCrawlerConfig.getStartSitemapURLs())@Deprecated public String[] getSitemapLocations()
HttpCrawlerConfig.getStartSitemapURLs()@Deprecated public void setSitemapLocations(String... sitemapLocations)
HttpCrawlerConfig.setStartSitemapURLs(String[])sitemapLocations - sitemap locationspublic boolean isLenient()
public void setLenient(boolean lenient)
public long getFromDate()
public void setFromDate(long fromDate)
fromDate - from datepublic boolean isEscalateErrors()
true if throwing errorspublic void setEscalateErrors(boolean escalateErrors)
escalateErrors - true if throwing errorspublic File getTempDir()
public void setTempDir(File tempDir)
tempDir - directorypublic void stop()
ISitemapResolverstop in interface ISitemapResolverCopyright © 2009–2021 Norconex Inc.. All rights reserved.