public class StandardSitemapResolver extends Object implements ISitemapResolver
Implementation of ISitemapResolver
as per sitemap.xml standard
defined at
http://www.sitemaps.org/protocol.html.
Sitemaps are only resolved if they have not been resolved already for the same URL "root" (the protocol, host and possible port).
The Sitemap specifications dictates that a sitemap.xml file defined
in a sub-directory applies only to URLs found in that sub-directory and
its children. This behavior is respected by default. Setting lenient
to true
no longer honors this restriction.
Since 2.9.1, setting lenient will also attempt to parse
XML values with invalid entities.
Paths relative to URL roots can be specified and an attempt will be made
to load and parse any sitemap found at those locations for each root URLs
encountered (except for "start URLs" sitemaps, see below). Default paths
are /sitemap.xml
and /sitemap_index.xml
.
Setting null
or an empty path array on
setSitemapPaths(String...)
will prevent attempts to locate
sitemaps and only sitemaps found in robots.txt or defined as start
URLs will be considered.
Sitemaps can be specified as "start URLs" (defined in
HttpCrawlerConfig.getStartSitemapURLs()
). Sitemaps defined
that way will be the only ones resolved for the root URL they represent
(sitemap paths or sitemaps defined in robots.txt won't apply).
Sitemaps are first stored in a local temporary file before
being parsed. The tempDir
constructor argument is used as the
location where to store those files. When null
, the system
temporary directory is used, as returned by
FileUtils.getTempDirectoryPath()
.
Modifier and Type | Field and Description |
---|---|
static String[] |
DEFAULT_SITEMAP_PATHS |
Constructor and Description |
---|
StandardSitemapResolver(File tempDir,
SitemapStore sitemapStore) |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
long |
getFromDate()
Gets the minimum EPOCH date (in milliseconds) a sitemap entry
should have to be considered.
|
String[] |
getSitemapLocations()
Deprecated.
Since 2.3.0, use
HttpCrawlerConfig.getStartSitemapURLs() |
String[] |
getSitemapPaths()
Gets the URL paths, relative to the URL root, from which to try
locate and resolve sitemaps.
|
File |
getTempDir()
Gets the directory where temporary sitemap files are written.
|
int |
hashCode() |
boolean |
isEscalateErrors()
Gets whether errors should be thrown instead of logged.
|
boolean |
isLenient() |
void |
resolveSitemaps(org.apache.http.client.HttpClient httpClient,
String urlRoot,
String[] sitemapLocations,
SitemapURLAdder sitemapURLAdder,
boolean startURLs)
Resolves the sitemap instructions for a URL "root" (e.g.
|
void |
setEscalateErrors(boolean escalateErrors)
Sets whether errors should be thrown instead of logged.
|
void |
setFromDate(long fromDate)
Sets the minimum EPOCH date (in milliseconds) a sitemap entry
should have to be considered.
|
void |
setLenient(boolean lenient) |
void |
setSitemapLocations(String... sitemapLocations)
Deprecated.
Since 2.3.0, use
HttpCrawlerConfig.setStartSitemapURLs(String[]) |
void |
setSitemapPaths(String... sitemapPaths)
Sets the URL paths, relative to the URL root, from which to try
locate and resolve sitemaps.
|
void |
setTempDir(File tempDir)
Sets the directory where temporary sitemap files are written.
|
void |
stop()
Stops any ongoing sitemap resolution.
|
String |
toString() |
public static final String[] DEFAULT_SITEMAP_PATHS
public StandardSitemapResolver(File tempDir, SitemapStore sitemapStore)
public String[] getSitemapPaths()
public void setSitemapPaths(String... sitemapPaths)
sitemapPaths
- sitemap paths.public void resolveSitemaps(org.apache.http.client.HttpClient httpClient, String urlRoot, String[] sitemapLocations, SitemapURLAdder sitemapURLAdder, boolean startURLs)
ISitemapResolver
resolveSitemaps
in interface ISitemapResolver
httpClient
- the http client to use to stream Internet
files if neededurlRoot
- the URL root for which to resolve the sitemapsitemapLocations
- sitemap locations to resolvesitemapURLAdder
- where to store retrieved site map URLsstartURLs
- whether the sitemapLocations provided (if any) are
start URLs (defined in HttpCrawlerConfig.getStartSitemapURLs()
)@Deprecated public String[] getSitemapLocations()
HttpCrawlerConfig.getStartSitemapURLs()
@Deprecated public void setSitemapLocations(String... sitemapLocations)
HttpCrawlerConfig.setStartSitemapURLs(String[])
sitemapLocations
- sitemap locationspublic boolean isLenient()
public void setLenient(boolean lenient)
public long getFromDate()
public void setFromDate(long fromDate)
fromDate
- from datepublic boolean isEscalateErrors()
true
if throwing errorspublic void setEscalateErrors(boolean escalateErrors)
escalateErrors
- true
if throwing errorspublic File getTempDir()
public void setTempDir(File tempDir)
tempDir
- directorypublic void stop()
ISitemapResolver
stop
in interface ISitemapResolver
Copyright © 2009–2021 Norconex Inc.. All rights reserved.