Class URLCrawlScopeStrategy


  • public class URLCrawlScopeStrategy
    extends Object

    By default a crawler will try to follow all links it discovers. You can define your own filters to limit the scope of the pages being crawled. When you have multiple URLs defined as start URLs, it can be tricky to perform global filtering that apply to each URLs without causing URL filtering conflicts. This class offers an easy way to address a frequent URL filtering need: to "stay on site". That is, when following a page and extracting URLs found in it, make sure to only keep URLs that are on the same site as the page URL we are on.

    By default this class does not request to stay on a site.

    Since:
    2.3.0
    Author:
    Pascal Essiembre
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      boolean equals​(Object other)  
      int hashCode()  
      boolean isIncludeSubdomains()
      Gets whether sub-domains are considered to be the same as a URL domain.
      boolean isInScope​(String inScopeURL, String candidateURL)  
      boolean isStayOnDomain()
      Whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.
      boolean isStayOnPort()
      Gets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.
      boolean isStayOnProtocol()
      Whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.
      void setIncludeSubdomains​(boolean includeSubdomains)
      Sets whether sub-domains are considered to be the same as a URL domain.
      void setStayOnDomain​(boolean stayOnDomain)
      Sets whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.
      void setStayOnPort​(boolean stayOnPort)
      Sets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.
      void setStayOnProtocol​(boolean stayOnProtocol)
      Sets whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.
      String toString()  
    • Constructor Detail

      • URLCrawlScopeStrategy

        public URLCrawlScopeStrategy()
    • Method Detail

      • isStayOnDomain

        public boolean isStayOnDomain()
        Whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).
        Returns:
        true if the crawler should stay on a domain
      • setStayOnDomain

        public void setStayOnDomain​(boolean stayOnDomain)
        Sets whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.
        Parameters:
        stayOnDomain - true for the crawler to stay on domain
      • isIncludeSubdomains

        public boolean isIncludeSubdomains()
        Gets whether sub-domains are considered to be the same as a URL domain. Only applicable when "stayOnDomain" is true.
        Returns:
        true if including sub-domains
        Since:
        2.9.0
      • setIncludeSubdomains

        public void setIncludeSubdomains​(boolean includeSubdomains)
        Sets whether sub-domains are considered to be the same as a URL domain. Only applicable when "stayOnDomain" is true.
        Parameters:
        includeSubdomains - true to include sub-domains
        Since:
        2.9.0
      • isStayOnPort

        public boolean isStayOnPort()
        Gets whether the crawler should always stay on the same port as the port for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).
        Returns:
        true if the crawler should stay on a port
      • setStayOnPort

        public void setStayOnPort​(boolean stayOnPort)
        Sets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.
        Parameters:
        stayOnPort - true for the crawler to stay on port
      • isStayOnProtocol

        public boolean isStayOnProtocol()
        Whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).
        Returns:
        true if the crawler should stay on protocol
      • setStayOnProtocol

        public void setStayOnProtocol​(boolean stayOnProtocol)
        Sets whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.
        Parameters:
        stayOnProtocol - true for the crawler to stay on protocol
      • isInScope

        public boolean isInScope​(String inScopeURL,
                                 String candidateURL)
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class Object