Class URLCrawlScopeStrategy
- java.lang.Object
-
- com.norconex.collector.http.crawler.URLCrawlScopeStrategy
-
public class URLCrawlScopeStrategy extends Object
By default a crawler will try to follow all links it discovers. You can define your own filters to limit the scope of the pages being crawled. When you have multiple URLs defined as start URLs, it can be tricky to perform global filtering that apply to each URLs without causing URL filtering conflicts. This class offers an easy way to address a frequent URL filtering need: to "stay on site". That is, when following a page and extracting URLs found in it, make sure to only keep URLs that are on the same site as the page URL we are on.
By default this class does not request to stay on a site.
- Since:
- 2.3.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description URLCrawlScopeStrategy()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
int
hashCode()
boolean
isIncludeSubdomains()
Gets whether sub-domains are considered to be the same as a URL domain.boolean
isInScope(String inScopeURL, String candidateURL)
boolean
isStayOnDomain()
Whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.boolean
isStayOnPort()
Gets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.boolean
isStayOnProtocol()
Whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.void
setIncludeSubdomains(boolean includeSubdomains)
Sets whether sub-domains are considered to be the same as a URL domain.void
setStayOnDomain(boolean stayOnDomain)
Sets whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.void
setStayOnPort(boolean stayOnPort)
Sets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.void
setStayOnProtocol(boolean stayOnProtocol)
Sets whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.String
toString()
-
-
-
Method Detail
-
isStayOnDomain
public boolean isStayOnDomain()
Whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).- Returns:
true
if the crawler should stay on a domain
-
setStayOnDomain
public void setStayOnDomain(boolean stayOnDomain)
Sets whether the crawler should always stay on the same domain name as the domain for each URL specified as a start URL.- Parameters:
stayOnDomain
-true
for the crawler to stay on domain
-
isIncludeSubdomains
public boolean isIncludeSubdomains()
Gets whether sub-domains are considered to be the same as a URL domain. Only applicable when "stayOnDomain" istrue
.- Returns:
true
if including sub-domains- Since:
- 2.9.0
-
setIncludeSubdomains
public void setIncludeSubdomains(boolean includeSubdomains)
Sets whether sub-domains are considered to be the same as a URL domain. Only applicable when "stayOnDomain" istrue
.- Parameters:
includeSubdomains
-true
to include sub-domains- Since:
- 2.9.0
-
isStayOnPort
public boolean isStayOnPort()
Gets whether the crawler should always stay on the same port as the port for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).- Returns:
true
if the crawler should stay on a port
-
setStayOnPort
public void setStayOnPort(boolean stayOnPort)
Sets whether the crawler should always stay on the same port as the port for each URL specified as a start URL.- Parameters:
stayOnPort
-true
for the crawler to stay on port
-
isStayOnProtocol
public boolean isStayOnProtocol()
Whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL. By default (false) the crawler will try follow any discovered links not otherwise rejected by other settings (like regular filtering rules you may have).- Returns:
true
if the crawler should stay on protocol
-
setStayOnProtocol
public void setStayOnProtocol(boolean stayOnProtocol)
Sets whether the crawler should always stay on the same protocol as the protocol for each URL specified as a start URL.- Parameters:
stayOnProtocol
-true
for the crawler to stay on protocol
-
-