Interface ISitemapResolver

  • All Known Implementing Classes:
    GenericSitemapResolver

    public interface ISitemapResolver

    Given a URL root, resolve the corresponding sitemap(s), if any, and only if it has not yet been resolved for a crawling session.

    Sitemaps URLs can be specified as "start URLs" (defined in HttpCrawlerConfig.getStartSitemapURLs()). It is up to the selected implementation to decide whether to treat sitemaps specified as start URLs any differently.

    When ignoring sitemap (HttpCrawlerConfig.isIgnoreSitemap()), the selected sitemap resolver implementation will still be invoked for sitemaps specified as start URLs.

    Sitemaps locations to resolved may also come from a site robots.txt (provided robots.txt files are not ignored).

    Is it possible for implementations to not attempt to resolve sitemaps for some URLs. Refer to specific implementation for more details.

    Author:
    Pascal Essiembre
    • Method Detail

      • resolveSitemaps

        void resolveSitemaps​(HttpFetchClient httpFetcher,
                             String urlRoot,
                             List<String> sitemapLocations,
                             Consumer<HttpDocInfo> sitemapURLConsumer,
                             boolean startURLs)
        Resolves the sitemap instructions for a URL "root" (e.g. http://www.example.com).
        Parameters:
        httpFetcher - the http fetcher executor to use to stream Internet files if needed
        urlRoot - the URL root for which to resolve the sitemap
        sitemapLocations - sitemap locations to resolve
        sitemapURLConsumer - where to store retrieved site map URLs
        startURLs - whether the sitemapLocations provided (if any) are start URLs (defined in HttpCrawlerConfig.getStartSitemapURLs())