Interface ISitemapResolver

All Known Implementing Classes:
GenericSitemapResolver

public interface ISitemapResolver

Given a URL root, resolve the corresponding sitemap(s), if any, and only if it has not yet been resolved for a crawling session.

Sitemaps URLs can be specified as "start URLs" (defined in HttpCrawlerConfig.getStartSitemapURLs()). It is up to the selected implementation to decide whether to treat sitemaps specified as start URLs any differently.

When ignoring sitemap (HttpCrawlerConfig.isIgnoreSitemap()), the selected sitemap resolver implementation will still be invoked for sitemaps specified as start URLs.

Sitemaps locations to resolved may also come from a site robots.txt (provided robots.txt files are not ignored).

Is it possible for implementations to not attempt to resolve sitemaps for some URLs. Refer to specific implementation for more details.

Author:
Pascal Essiembre
  • Method Details

    • resolveSitemaps

      void resolveSitemaps(HttpFetchClient httpFetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs)
      Resolves the sitemap instructions for a URL "root" (e.g. http://www.example.com).
      Parameters:
      httpFetcher - the http fetcher executor to use to stream Internet files if needed
      urlRoot - the URL root for which to resolve the sitemap
      sitemapLocations - sitemap locations to resolve
      sitemapURLConsumer - where to store retrieved site map URLs
      startURLs - whether the sitemapLocations provided (if any) are start URLs (defined in HttpCrawlerConfig.getStartSitemapURLs())