Interface IRobotsTxtProvider

All Known Implementing Classes:
StandardRobotsTxtProvider

public interface IRobotsTxtProvider
Given a URL, extract any "robots.txt" rules. Implementations are expected to cache existing robots.txt instances or, cache the fact none was found, for the duration of a crawl session so no attempt to re-download it is made.
Author:
Pascal Essiembre
  • Method Details

    • getRobotsTxt

      RobotsTxt getRobotsTxt(HttpFetchClient fetchClient, String url)
      Gets robots.txt rules. This method signature changed in 1.3 to include the userAgent.
      Parameters:
      fetchClient - http fetcher executor to grab robots.txt
      url - the URL to derive the robots.txt from
      Returns:
      robots.txt rules