Class GenericSitemapResolver

java.lang.Object
com.norconex.collector.core.crawler.CrawlerLifeCycleListener
com.norconex.collector.http.sitemap.impl.GenericSitemapResolver
All Implemented Interfaces:
ISitemapResolver, IEventListener<CrawlerEvent>, IXMLConfigurable, EventListener, Consumer<CrawlerEvent>

public class GenericSitemapResolver extends CrawlerLifeCycleListener implements ISitemapResolver, IXMLConfigurable

Implementation of ISitemapResolver as per sitemap.xml standard defined at http://www.sitemaps.org/protocol.html.

Sitemaps are only resolved if they have not been resolved already for the same URL "root" (the protocol, host and possible port).

The Sitemap specification dictates that a sitemap.xml file defined in a sub-directory applies only to URLs found in that sub-directory and its children. This behavior is respected by default. Setting lenient to true no longer honors this restriction.

Paths relative to URL roots can be specified and an attempt will be made to load and parse any sitemap found at those locations for each root URLs encountered (except for "start URLs" sitemaps, see below). Default paths are /sitemap.xml and /sitemap_index.xml. Setting null or an empty path array on setSitemapPaths(String...) will prevent attempts to locate sitemaps and only sitemaps found in robots.txt or defined as start URLs will be considered.

Sitemaps can be specified as "start URLs" (defined in HttpCrawlerConfig.getStartSitemapURLs()). Sitemaps defined that way will be the only ones resolved for the root URL they represent (sitemap paths or sitemaps defined in robots.txt won't apply).

Sitemaps are first stored in a local temporary file before being parsed. A directory relative to the crawler work directory will be created by default. To specify a custom directory, you can use setTempDir(Path).

Since:
3.0.0 (merged fro StandardSitemapResolver*)
Author:
Pascal Essiembre
  • Field Details

    • DEFAULT_SITEMAP_PATHS

      public static final List<String> DEFAULT_SITEMAP_PATHS
  • Constructor Details

    • GenericSitemapResolver

      public GenericSitemapResolver()
  • Method Details

    • onCrawlerRunBegin

      protected void onCrawlerRunBegin(CrawlerEvent event)
      Overrides:
      onCrawlerRunBegin in class CrawlerLifeCycleListener
    • onCrawlerStopBegin

      protected void onCrawlerStopBegin(CrawlerEvent event)
      Overrides:
      onCrawlerStopBegin in class CrawlerLifeCycleListener
    • onCrawlerEvent

      protected void onCrawlerEvent(CrawlerEvent event)
      Overrides:
      onCrawlerEvent in class CrawlerLifeCycleListener
    • onCrawlerCleanBegin

      protected void onCrawlerCleanBegin(CrawlerEvent event)
      Overrides:
      onCrawlerCleanBegin in class CrawlerLifeCycleListener
    • getSitemapPaths

      public List<String> getSitemapPaths()
      Gets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps. Default paths are "/sitemap.xml" and "/sitemap-index.xml".
      Returns:
      sitemap paths.
    • setSitemapPaths

      public void setSitemapPaths(String... sitemapPaths)
      Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.
      Parameters:
      sitemapPaths - sitemap paths.
    • setSitemapPaths

      public void setSitemapPaths(List<String> sitemapPaths)
      Sets the URL paths, relative to the URL root, from which to try locate and resolve sitemaps.
      Parameters:
      sitemapPaths - sitemap paths.
    • resolveSitemaps

      public void resolveSitemaps(HttpFetchClient fetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs)
      Description copied from interface: ISitemapResolver
      Resolves the sitemap instructions for a URL "root" (e.g. http://www.example.com).
      Specified by:
      resolveSitemaps in interface ISitemapResolver
      Parameters:
      fetcher - the http fetcher executor to use to stream Internet files if needed
      urlRoot - the URL root for which to resolve the sitemap
      sitemapLocations - sitemap locations to resolve
      sitemapURLConsumer - where to store retrieved site map URLs
      startURLs - whether the sitemapLocations provided (if any) are start URLs (defined in HttpCrawlerConfig.getStartSitemapURLs())
    • isLenient

      public boolean isLenient()
    • setLenient

      public void setLenient(boolean lenient)
    • getTempDir

      public Path getTempDir()
      Gets the directory where temporary sitemap files are written.
      Returns:
      directory
    • setTempDir

      public void setTempDir(Path tempDir)
      Sets the directory where temporary sitemap files are written.
      Parameters:
      tempDir - directory
    • loadFromXML

      public void loadFromXML(XML xml)
      Specified by:
      loadFromXML in interface IXMLConfigurable
    • saveToXML

      public void saveToXML(XML xml)
      Specified by:
      saveToXML in interface IXMLConfigurable
    • equals

      public boolean equals(Object other)
      Overrides:
      equals in class Object
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object
    • toString

      public String toString()
      Overrides:
      toString in class Object