public interface ISitemapResolver
Given a URL root, resolve the corresponding sitemap(s), if any, and only if it has not yet been resolved for a crawling session.
Sitemaps URLs can be specified as "start URLs" (defined in
HttpCrawlerConfig.getStartSitemapURLs()
). It is up to the selected
implementation to decide whether to treat sitemaps specified as start URLs
any differently.
When ignoring sitemap (HttpCrawlerConfig.isIgnoreSitemap()
),
the selected sitemap
resolver implementation will still be invoked for sitemaps specified as
start URLs.
Sitemaps locations to resolved may also come from a site
robots.txt
(provided robots.txt files are not ignored).
Is it possible for implementations to not attempt to resolve sitemaps for some URLs. Refer to specific implementation for more details.
StandardSitemapResolver
Modifier and Type | Method and Description |
---|---|
void |
resolveSitemaps(org.apache.http.client.HttpClient httpClient,
String urlRoot,
String[] sitemapLocations,
SitemapURLAdder sitemapURLAdder,
boolean startURLs)
Resolves the sitemap instructions for a URL "root" (e.g.
|
void |
stop()
Stops any ongoing sitemap resolution.
|
void resolveSitemaps(org.apache.http.client.HttpClient httpClient, String urlRoot, String[] sitemapLocations, SitemapURLAdder sitemapURLAdder, boolean startURLs)
httpClient
- the http client to use to stream Internet
files if neededurlRoot
- the URL root for which to resolve the sitemapsitemapLocations
- sitemap locations to resolvesitemapURLAdder
- where to store retrieved site map URLsstartURLs
- whether the sitemapLocations provided (if any) are
start URLs (defined in HttpCrawlerConfig.getStartSitemapURLs()
)void stop()
Copyright © 2009–2021 Norconex Inc.. All rights reserved.