Class GenericSitemapResolver

  • All Implemented Interfaces:
    ISitemapResolver, IEventListener<CrawlerEvent>, IXMLConfigurable, EventListener, Consumer<CrawlerEvent>

    public class GenericSitemapResolver
    extends CrawlerLifeCycleListener
    implements ISitemapResolver, IXMLConfigurable

    Implementation of ISitemapResolver as per sitemap.xml standard defined at http://www.sitemaps.org/protocol.html.

    Sitemaps are only resolved if they have not been resolved already for the same URL "root" (the protocol, host and possible port).

    The Sitemap specification dictates that a sitemap.xml file defined in a sub-directory applies only to URLs found in that sub-directory and its children. This behavior is respected by default. Setting lenient to true no longer honors this restriction.

    Paths relative to URL roots can be specified and an attempt will be made to load and parse any sitemap found at those locations for each root URLs encountered (except for "start URLs" sitemaps, see below). Default paths are /sitemap.xml and /sitemap_index.xml. Setting null or an empty path array on setSitemapPaths(String...) will prevent attempts to locate sitemaps and only sitemaps found in robots.txt or defined as start URLs will be considered.

    Sitemaps can be specified as "start URLs" (defined in HttpCrawlerConfig.getStartSitemapURLs()). Sitemaps defined that way will be the only ones resolved for the root URL they represent (sitemap paths or sitemaps defined in robots.txt won't apply).

    Sitemaps are first stored in a local temporary file before being parsed. A directory relative to the crawler work directory will be created by default. To specify a custom directory, you can use setTempDir(Path).

    Since:
    3.0.0 (merged fro StandardSitemapResolver*)
    Author:
    Pascal Essiembre
    • Field Detail

      • DEFAULT_SITEMAP_PATHS

        public static final List<String> DEFAULT_SITEMAP_PATHS
    • Constructor Detail

      • GenericSitemapResolver

        public GenericSitemapResolver()
    • Method Detail

      • getSitemapPaths

        public List<String> getSitemapPaths()
        Gets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps. Default paths are "/sitemap.xml" and "/sitemap-index.xml".
        Returns:
        sitemap paths.
      • setSitemapPaths

        public void setSitemapPaths​(String... sitemapPaths)
        Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.
        Parameters:
        sitemapPaths - sitemap paths.
      • setSitemapPaths

        public void setSitemapPaths​(List<String> sitemapPaths)
        Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.
        Parameters:
        sitemapPaths - sitemap paths.
      • resolveSitemaps

        public void resolveSitemaps​(HttpFetchClient fetcher,
                                    String urlRoot,
                                    List<String> sitemapLocations,
                                    Consumer<HttpDocInfo> sitemapURLConsumer,
                                    boolean startURLs)
        Description copied from interface: ISitemapResolver
        Resolves the sitemap instructions for a URL "root" (e.g. http://www.example.com).
        Specified by:
        resolveSitemaps in interface ISitemapResolver
        Parameters:
        fetcher - the http fetcher executor to use to stream Internet files if needed
        urlRoot - the URL root for which to resolve the sitemap
        sitemapLocations - sitemap locations to resolve
        sitemapURLConsumer - where to store retrieved site map URLs
        startURLs - whether the sitemapLocations provided (if any) are start URLs (defined in HttpCrawlerConfig.getStartSitemapURLs())
      • isLenient

        public boolean isLenient()
      • setLenient

        public void setLenient​(boolean lenient)
      • getTempDir

        public Path getTempDir()
        Gets the directory where temporary sitemap files are written.
        Returns:
        directory
      • setTempDir

        public void setTempDir​(Path tempDir)
        Sets the directory where temporary sitemap files are written.
        Parameters:
        tempDir - directory
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class Object