Class URLCrawlScopeStrategy
- java.lang.Object
-
- com.norconex.collector.http.crawler.URLCrawlScopeStrategy
-
public class URLCrawlScopeStrategy extends Object
By default a crawler will try to follow all links it discovers. You can define your own filters to limit the scope of the pages being crawled. When you have multiple URLs defined as start URLs, it can be tricky to perform global filtering that apply to each URLs without causing URL filtering conflicts. This class offers an easy way to address a frequent URL filtering need: to "stay on site". That is, when following a page and extracting URLs found in it, make sure to only keep URLs that are on the same site as the page URL we are on.
By default this class does not request to stay on a site.
- Since:
- 2.3.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description URLCrawlScopeStrategy()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description booleanequals(Object other)inthashCode()booleanisIncludeSubdomains()Gets whether sub-domains are considered to be the same as a URL domain.booleanisInScope(String inScopeURL, String candidateURL)booleanisStayOnDomain()Whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.booleanisStayOnPort()Gets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.booleanisStayOnProtocol()Whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.voidsetIncludeSubdomains(boolean includeSubdomains)Sets whether sub-domains are considered to be the same as a URL domain.voidsetStayOnDomain(boolean stayOnDomain)Sets whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.voidsetStayOnPort(boolean stayOnPort)Sets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.voidsetStayOnProtocol(boolean stayOnProtocol)Sets whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.StringtoString()
-
-
-
Method Detail
-
isStayOnDomain
public boolean isStayOnDomain()
Whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).- Returns:
trueif the crawler should stay on a domain
-
setStayOnDomain
public void setStayOnDomain(boolean stayOnDomain)
Sets whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.- Parameters:
stayOnDomain-truefor the crawler to stay on domain
-
isIncludeSubdomains
public boolean isIncludeSubdomains()
Gets whether sub-domains are considered to be the same as a URL domain. Only applicable when "stayOnDomain" istrue.- Returns:
trueif including sub-domains- Since:
- 2.9.0
-
setIncludeSubdomains
public void setIncludeSubdomains(boolean includeSubdomains)
Sets whether sub-domains are considered to be the same as a URL domain. Only applicable when "stayOnDomain" istrue.- Parameters:
includeSubdomains-trueto include sub-domains- Since:
- 2.9.0
-
isStayOnPort
public boolean isStayOnPort()
Gets whether the crawler should always stay on the same port as the port for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).- Returns:
trueif the crawler should stay on a port
-
setStayOnPort
public void setStayOnPort(boolean stayOnPort)
Sets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.- Parameters:
stayOnPort-truefor the crawler to stay on port
-
isStayOnProtocol
public boolean isStayOnProtocol()
Whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).- Returns:
trueif the crawler should stay on protocol
-
setStayOnProtocol
public void setStayOnProtocol(boolean stayOnProtocol)
Sets whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.- Parameters:
stayOnProtocol-truefor the crawler to stay on protocol
-
-