Class URLCrawlScopeStrategy

java.lang.Object
com.norconex.collector.http.crawler.URLCrawlScopeStrategy

public class URLCrawlScopeStrategy extends Object

By default a crawler will try to follow all links it discovers. You can define your own filters to limit the scope of the pages being crawled. When you have multiple URLs defined as start URLs, it can be tricky to perform global filtering that apply to each URLs without causing URL filtering conflicts. This class offers an easy way to address a frequent URL filtering need: to "stay on site". That is, when following a page and extracting URLs found in it, make sure to only keep URLs that are on the same site as the page URL we are on.

By default this class does not request to stay on a site.

Since:
2.3.0
Author:
Pascal Essiembre
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    boolean
    equals(Object other)
     
    int
     
    boolean
    Gets whether sub-domains are considered to be the same as a URL domain.
    boolean
    isInScope(String inScopeURL, String candidateURL)
     
    boolean
    Whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.
    boolean
    Gets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.
    boolean
    Whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.
    void
    setIncludeSubdomains(boolean includeSubdomains)
    Sets whether sub-domains are considered to be the same as a URL domain.
    void
    setStayOnDomain(boolean stayOnDomain)
    Sets whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.
    void
    setStayOnPort(boolean stayOnPort)
    Sets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.
    void
    setStayOnProtocol(boolean stayOnProtocol)
    Sets whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.
     

    Methods inherited from class java.lang.Object

    clone, finalize, getClass, notify, notifyAll, wait, wait, wait
  • Constructor Details

    • URLCrawlScopeStrategy

      public URLCrawlScopeStrategy()
  • Method Details

    • isStayOnDomain

      public boolean isStayOnDomain()
      Whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).
      Returns:
      true if the crawler should stay on a domain
    • setStayOnDomain

      public void setStayOnDomain(boolean stayOnDomain)
      Sets whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.
      Parameters:
      stayOnDomain - true for the crawler to stay on domain
    • isIncludeSubdomains

      public boolean isIncludeSubdomains()
      Gets whether sub-domains are considered to be the same as a URL domain. Only applicable when "stayOnDomain" is true.
      Returns:
      true if including sub-domains
      Since:
      2.9.0
    • setIncludeSubdomains

      public void setIncludeSubdomains(boolean includeSubdomains)
      Sets whether sub-domains are considered to be the same as a URL domain. Only applicable when "stayOnDomain" is true.
      Parameters:
      includeSubdomains - true to include sub-domains
      Since:
      2.9.0
    • isStayOnPort

      public boolean isStayOnPort()
      Gets whether the crawler should always stay on the same port as the port for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).
      Returns:
      true if the crawler should stay on a port
    • setStayOnPort

      public void setStayOnPort(boolean stayOnPort)
      Sets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.
      Parameters:
      stayOnPort - true for the crawler to stay on port
    • isStayOnProtocol

      public boolean isStayOnProtocol()
      Whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).
      Returns:
      true if the crawler should stay on protocol
    • setStayOnProtocol

      public void setStayOnProtocol(boolean stayOnProtocol)
      Sets whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.
      Parameters:
      stayOnProtocol - true for the crawler to stay on protocol
    • isInScope

      public boolean isInScope(String inScopeURL, String candidateURL)
    • equals

      public boolean equals(Object other)
      Overrides:
      equals in class Object
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object
    • toString

      public String toString()
      Overrides:
      toString in class Object