Interface IRobotsTxtProvider

  • All Known Implementing Classes:
    StandardRobotsTxtProvider

    public interface IRobotsTxtProvider
    Given a URL, extract any "robots.txt" rules. Implementations are expected to cache existing robots.txt instances or, cache the fact none was found, for the duration of a crawl session so no attempt to re-download it is made.
    Author:
    Pascal Essiembre
    • Method Detail

      • getRobotsTxt

        RobotsTxt getRobotsTxt​(HttpFetchClient fetchClient,
                               String url)
        Gets robots.txt rules. This method signature changed in 1.3 to include the userAgent.
        Parameters:
        fetchClient - http fetcher executor to grab robots.txt
        url - the URL to derive the robots.txt from
        Returns:
        robots.txt rules