Interface ISitemapResolver
-
- All Known Implementing Classes:
GenericSitemapResolver
public interface ISitemapResolver
Given a URL root, resolve the corresponding sitemap(s), if any, and only if it has not yet been resolved for a crawling session.
Sitemaps URLs can be specified as "start URLs" (defined in
HttpCrawlerConfig.getStartSitemapURLs()
). It is up to the selected implementation to decide whether to treat sitemaps specified as start URLs any differently.When ignoring sitemap (
HttpCrawlerConfig.isIgnoreSitemap()
), the selected sitemap resolver implementation will still be invoked for sitemaps specified as start URLs.Sitemaps locations to resolved may also come from a site
robots.txt
(provided robots.txt files are not ignored).Is it possible for implementations to not attempt to resolve sitemaps for some URLs. Refer to specific implementation for more details.
- Author:
- Pascal Essiembre
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description void
resolveSitemaps(HttpFetchClient httpFetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs)
Resolves the sitemap instructions for a URL "root" (e.g.
-
-
-
Method Detail
-
resolveSitemaps
void resolveSitemaps(HttpFetchClient httpFetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs)
Resolves the sitemap instructions for a URL "root" (e.g. http://www.example.com).- Parameters:
httpFetcher
- the http fetcher executor to use to stream Internet files if neededurlRoot
- the URL root for which to resolve the sitemapsitemapLocations
- sitemap locations to resolvesitemapURLConsumer
- where to store retrieved site map URLsstartURLs
- whether the sitemapLocations provided (if any) are start URLs (defined inHttpCrawlerConfig.getStartSitemapURLs()
)
-
-