public class URLCrawlScopeStrategy extends Object
By default a crawler will try to follow all links it discovers. You can define your own filters to limit the scope of the pages being crawled. When you have multiple URLs defined as start URLs, it can be tricky to perform global filtering that apply to each URLs without causing URL filtering conflicts. This class offers an easy way to address a frequent URL filtering need: to "stay on site". That is, when following a page and extracting URLs found in it, make sure to only keep URLs that are on the same site as the page URL we are on.
By default this class does not request to stay on a site.
Constructor and Description |
---|
URLCrawlScopeStrategy() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
int |
hashCode() |
boolean |
isIncludeSubdomains()
Gets whether sub-domains are considered to be the same as a URL domain.
|
boolean |
isInScope(String inScopeURL,
String candidateURL) |
boolean |
isStayOnDomain()
Gets whether the crawler should always stay on the same domain name as
the domain for each URL specified as a start URL.
|
boolean |
isStayOnPort()
Gets whether the crawler should always stay on the same port as
the port for each URL specified as a start URL.
|
boolean |
isStayOnProtocol()
Gets whether the crawler should always stay on the same protocol as
the protocol for each URL specified as a start URL.
|
void |
setIncludeSubdomains(boolean includeSubdomains)
Sets whether sub-domains are considered to be the same as a URL domain.
|
void |
setStayOnDomain(boolean stayOnDomain)
Sets whether the crawler should always stay on the same domain name as
the domain for each URL specified as a start URL.
|
void |
setStayOnPort(boolean stayOnPort)
Sets whether the crawler should always stay on the same port as
the port for each URL specified as a start URL.
|
void |
setStayOnProtocol(boolean stayOnProtocol)
Sets whether the crawler should always stay on the same protocol as
the protocol for each URL specified as a start URL.
|
String |
toString() |
public boolean isStayOnDomain()
true
if the crawler should stay on a domainpublic void setStayOnDomain(boolean stayOnDomain)
stayOnDomain
- true
for the crawler to stay on domainpublic boolean isIncludeSubdomains()
true
.true
if including sub-domainspublic void setIncludeSubdomains(boolean includeSubdomains)
true
.includeSubdomains
- true
to include sub-domainspublic boolean isStayOnPort()
true
if the crawler should stay on a portpublic void setStayOnPort(boolean stayOnPort)
stayOnPort
- true
for the crawler to stay on portpublic boolean isStayOnProtocol()
true
if the crawler should stay on protocolpublic void setStayOnProtocol(boolean stayOnProtocol)
stayOnProtocol
- true
for the crawler to stay on protocolCopyright © 2009–2021 Norconex Inc.. All rights reserved.