public class HttpCrawlerConfig extends CrawlerConfig
HTTP Crawler configuration.
Crawling begins with one or more "start" URLs. Multiple start URLs can be defined, in a combination of ways:
setStartURLs(List)
).setStartURLsFiles(List)
). One per line.setStartSitemapURLs(List)
).IStartURLsProvider
to dynamically provide a list of start
URLs (see setStartURLsProviders(List)
).
Scope: To limit crawling to specific web domains, and avoid creating
many filters to that effect, you can tell the crawler to "stay" within
the web site "scope" with
setUrlCrawlScopeStrategy(URLCrawlScopeStrategy)
.
Pages on web sites are often referenced using different URL
patterns. Such URL variations can fool the crawler into downloading the
same document multiple times. To avoid this, URLs are "normalized". That is,
they are converted so they are always formulated the same way.
By default, the crawler only applies normalization in ways that are
semantically equivalent (see GenericURLNormalizer
).
Be kind to web sites you crawl. Being too aggressive can be perceived as a cyber-attack by the targeted web site (e.g., DoS attack). This can lead to your crawler being blocked.
For this reason, the crawler plays nice by default. It will wait a
few seconds between each page download, regardless of the maximum
number of threads specified or whether pages crawled are on different
web sites. This can of course be changed to be as fast as you want.
See GenericDelayResolver
)
for changing default options. You can also provide your own "delay resolver"
by supplying a class implementing IDelayResolver
.
The crawl depth represents how many level from the start URL the crawler
goes. From a browser user perspective, it can be seen as the number of
link "clicks" required from a start URL in order to get to a specific page.
The crawler will crawl as deep for as long as it discovers new URLs
not getting rejected by your configuration. This is not always desirable.
For instance, a web site could have dynamically generated URLs with infinite
possibilities (e.g., dynamically generated web calendars). To avoid
infinite crawls, it is recommended to limit the maximum depth to something
reasonable for your site with setMaxDepth(int)
.
Downloaded files are deleted after being processed. Set
setKeepDownloads(boolean)
to true
in order to preserve
them. Files will be kept under a new "downloads" folder found under
your working directory. Keep in mind this is not a method for cloning a
site. Use with caution on large sites as it can quickly
fill up the local disk space.
By default the crawler stores, as metadata, URLs extracted from
documents that are in scope. Exceptions
are pages discovered at the configured maximum depth
(setMaxDepth(int)
).
This can be changed using the
setKeepReferencedLinks(Set)
method.
Changing this setting has no incidence on what page gets crawled.
Possible options are:
HttpDocMetadata.REFERENCED_URLS
.HttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE
.
Orphans are valid documents, which on subsequent crawls can no longer be
reached (e.g. there are no longer referenced). This is
regardless whether the file has been deleted or not at the source.
You can tell the crawler how to handle those with
CrawlerConfig.setOrphansStrategy(OrphansStrategy)
. Possible options are:
By default the crawler logs exceptions while trying to prevent them
from terminating a crawling session. There might be cases where you want
the crawler to halt upon encountering some types of exceptions.
You can do so with CrawlerConfig.setStopOnExceptions(List)
.
The crawler fires all kind of events to notify interested parties of such
things as when a document is rejected, imported, committed, etc.).
You can listen to crawler events using CrawlerConfig.setEventListeners(List)
.
During and between crawl sessions, the crawler needs to preserve
specific information in order to keep track of
things such as a queue of document reference to process,
those already processed, whether a document has been modified since last
crawled, caching of document checksums, etc.
For this, the crawler uses a database we call a crawl data store engine.
The default implementation uses the local file system to store these
(see MVStoreDataStoreEngine
). While very capable and suitable
for most sites, if you need a larger storage system, you can provide your
own implementation with CrawlerConfig.setDataStoreEngine(IDataStoreEngine)
.
The process of transforming, enhancing, parsing to extracting plain text
and many other document-specific processing activities are handled by the
Norconex Importer module. See ImporterConfig
for many
additional configuration options.
On a fresh crawl, documents that are unreachable or not obtained
successfully for some reason are simply logged and ignored.
On the other hand, documents that were successfully crawled once
and are suddenly failing on a subsequent crawl are considered "spoiled".
You can decide whether to grace (retry next time), delete, or ignore
those spoiled documents with
CrawlerConfig.setSpoiledReferenceStrategizer(ISpoiledReferenceStrategizer)
.
The last step of a successful processing of a document is to
store it in your preferred target repository (or repositories).
For this to happen, you have to configure one or more Committers
corresponding to your needs or create a custom one.
You can have a look at available Committers here:
https://opensource.norconex.com/committers/
See CrawlerConfig.setCommitters(List)
.
To crawl and parse a document, it needs to be downloaded first. This is the
role of one or more HTTP Fetchers. GenericHttpFetcher
is the
default implementation and can handle most web sites.
There might be cases where a more specialized way of obtaining web resources
is needed. For instance, JavaScript-generated web pages are often best
handled by web browsers. In such case you can use the
WebDriverHttpFetcher
. You can also use
setHttpFetchers(List)
to supply own fetcher implementation.
A fetcher typically issues an HTTP GET request to obtain a document. There might be cases where you first want to issue a separate HEAD request. One example is to filter documents based on the HTTP HEAD response information, thus possibly saving downloading large files you don't want.
You can tell the crawler how it should handle HTTP GET and HEAD requests
using using setFetchHttpGet(HttpMethodSupport)
and
setFetchHttpHead(HttpMethodSupport)
respectively.
For each, the options are:
If you enable only one HTTP method (default), then specifying OPTIONAL or REQUIRED for it have the same effect. At least one method needs to be enabled for an HTTP request to be attempted. By default HEAD requests are DISABLED and GET are REQUIRED. If you are unsure what settings to use, keep the defaults.
Without filtering, you would typically crawl many documents you are not interested in. There are different types filtering offered to you, occurring at different type during a URL crawling process. The sooner in a URL processing life-cycle you filter out a document the more you can improve the crawler performance. It may be important for you to understand the differences:
Metadata filters: Applies filtering on a document metadata fields.
If isFetchHttpHead()
returns true
, these filters
will be invoked after the crawler performs a distinct HTTP HEAD request.
It gives you the opportunity to filter documents based on the HTTP HEAD
response to potentially save a more expensive HTTP GET request for
download (but results in two HTTP requests for valid documents --
HEAD and GET). Filtering occurs before URLs are extracted.
When isFetchHttpHead()
is false
, these filters
will be invoked on the metadata of the HTTP response
obtained from an HTTP GET request (as the document is downloaded).
Filtering occurs after URLs are extracted.
By default, the crawler tries to respect instructions a web site as put in place for the benefit of crawlers. Here is a list of some of the popular ones that can be turned off or supports your own implementation.
X-Robots-Tag
. See:
setIgnoreRobotsTxt(boolean)
,
setRobotsTxtProvider(IRobotsTxtProvider)
,
setIgnoreRobotsMeta(boolean)
,
setRobotsMetaProvider(IRobotsMetaProvider)
rel="nofollow"
attribute set on HTML links.
See: HtmlLinkExtractor.setIgnoreNofollow(boolean)
setIgnoreSitemap(boolean)
.<meta ...>
or
HTTP response instructions. To crawl non-canonical pages, use
setIgnoreCanonicalLinks(boolean)
.
GenericHttpFetcher
) uses the If-Modified-Since
feature as part of its HTTP requests for web sites supporting it
(only affects incremental crawls). To turn that off, use
GenericHttpFetcherConfig.setDisableIfModifiedSince(boolean)
.
The crawler will crawl any given URL at most one time per crawling session.
It is possible to skip documents that are not yet "ready" to be re-crawled
to speed up each crawling sessions.
Sitemap.xml directives to that effect are respected by default
("frequency" and "lastmod"). You can have your own conditions for re-crawl
with setRecrawlableResolver(IRecrawlableResolver)
.
This feature can be used for instance, to crawl a "news" section of your
site more frequently than let's say, an "archive" section of your site.
To find out if a document has changed from one crawling session to another,
the crawler creates and keeps a digital signature, or checksum of each
crawled documents. Upon crawling the same URL again, a new checksum
is created and compared against the previous one. Any difference indicates
a modified document. There are two checksums at play, tested at
different times. One obtained from
a document metadata (default is LastModifiedMetadataChecksummer
,
and one from the document itself MD5DocumentChecksummer
. You can
provide your own implementation. See:
CrawlerConfig.setMetadataChecksummer(IMetadataChecksummer)
and
CrawlerConfig.setDocumentChecksummer(IDocumentChecksummer)
.
EXPERIMENTAL:
The crawler can attempt to detect and reject documents considered as
duplicates within a crawler session. A document will be considered
duplicate if there was already a document processed with the same
metadata or document checksum. To enable this feature, set
CrawlerConfig.setMetadataDeduplicate(boolean)
and/or
CrawlerConfig.setDocumentDeduplicate(boolean)
to true
. Setting
those will have no effect if the corresponding checksummers are
not set (null
).
Deduplication can impact crawl performance. It is recommended you use it only if you can't distinguish duplicates via other means (URL normalizer, canonical URL support, etc.). Also, you should only enable this feature if you know your checksummer(s) will generate a checksum that is acceptably unique to you.
To be able to crawl a web site, links need to be extracted from
web pages. It is the job of a link extractor. It is possible to use
multiple link extractor for different type of content. By default,
the HtmlLinkExtractor
is used, but you can add others or
provide your own with setLinkExtractors(List)
.
There might be
cases where you want a document to be parsed by the Importer and establish
which links to process yourself during the importing phase (for more
advanced use cases). In such cases, you can identify a document metadata
field to use as a URL holding tanks after importing has occurred.
URLs in that field will become eligible for crawling.
See setPostImportLinks(TextMatcher)
.
<crawler
id="(crawler unique identifier)">
<startURLs
stayOnDomain="[false|true]"
includeSubdomains="[false|true]"
stayOnPort="[false|true]"
stayOnProtocol="[false|true]"
async="[false|true]">
<!-- All the following tags are repeatable. -->
<url>(a URL)</url>
<urlsFile>(local path to a file containing URLs)</urlsFile>
<sitemap>(URL to a sitemap XML)</sitemap>
<provider
class="(IStartURLsProvider implementation)"/>
</startURLs>
<urlNormalizer
class="(IURLNormalizer implementation)"/>
<delay
class="(IDelayResolver implementation)"/>
<maxDepth>(maximum crawl depth)</maxDepth>
<keepDownloads>[false|true]</keepDownloads>
<keepReferencedLinks>[INSCOPE|OUTSCOPE|MAXDEPTH]</keepReferencedLinks>
<numThreads>(maximum number of threads)</numThreads>
<maxDocuments>(maximum number of documents to crawl)</maxDocuments>
<orphansStrategy>[PROCESS|IGNORE|DELETE]</orphansStrategy>
<stopOnExceptions>
<!-- Repeatable -->
<exception>(fully qualified class name of a an exception)</exception>
</stopOnExceptions>
<eventListeners>
<!-- Repeatable -->
<listener
class="(IEventListener implementation)"/>
</eventListeners>
<dataStoreEngine
class="(IDataStoreEngine implementation)"/>
<fetchHttpHead>[DISABLED|REQUIRED|OPTIONAL]</fetchHttpHead>
<fetchHttpGet>[REQUIRED|DISABLED|OPTIONAL]</fetchHttpGet>
<httpFetchers
maxRetries="(number of times to retry a failed fetch attempt)"
retryDelay="(how many milliseconds to wait between re-attempting)">
<!-- Repeatable -->
<fetcher
class="(IHttpFetcher implementation)"/>
</httpFetchers>
<referenceFilters>
<!-- Repeatable -->
<filter
class="(IReferenceFilter implementation)"
onMatch="[include|exclude]"/>
</referenceFilters>
<robotsTxt
ignore="[false|true]"
class="(IRobotsMetaProvider implementation)"/>
<sitemapResolver
ignore="[false|true]"
class="(ISitemapResolver implementation)"/>
<recrawlableResolver
class="(IRecrawlableResolver implementation)"/>
<canonicalLinkDetector
ignore="[false|true]"
class="(ICanonicalLinkDetector implementation)"/>
<metadataChecksummer
class="(IMetadataChecksummer implementation)"/>
<metadataDeduplicate>[false|true]</metadataDeduplicate>
<robotsMeta
ignore="[false|true]"
class="(IRobotsMetaProvider implementation)"/>
<linkExtractors>
<!-- Repeatable -->
<extractor
class="(ILinkExtractor implementation)"/>
</linkExtractors>
<metadataFilters>
<!-- Repeatable -->
<filter
class="(IMetadataFilter implementation)"
onMatch="[include|exclude]"/>
</metadataFilters>
<documentFilters>
<!-- Repeatable -->
<filter
class="(IDocumentFilter implementation)"/>
</documentFilters>
<preImportProcessors>
<!-- Repeatable -->
<processor
class="(IHttpDocumentProcessor implementation)"/>
</preImportProcessors>
<importer>
<preParseHandlers>
<!-- Repeatable -->
<handler
class="(an handler class from the Importer module)"/>
</preParseHandlers>
<documentParserFactory
class="(IDocumentParser implementation)"/>
<postParseHandlers>
<!-- Repeatable -->
<handler
class="(an handler class from the Importer module)"/>
</postParseHandlers>
<responseProcessors>
<!-- Repeatable -->
<responseProcessor
class="(IImporterResponseProcessor implementation)"/>
</responseProcessors>
</importer>
<documentChecksummer
class="(IDocumentChecksummer implementation)"/>
<documentDeduplicate>[false|true]</documentDeduplicate>
<postImportProcessors>
<!-- Repeatable -->
<processor
class="(IHttpDocumentProcessor implementation)"/>
</postImportProcessors>
<postImportLinks
keep="[false|true]">
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]"/>
</postImportLinks>
<spoiledReferenceStrategizer
class="(ISpoiledReferenceStrategizer implementation)"/>
<committers>
<committer
class="(ICommitter implementation)"/>
</committers>
</crawler>
Modifier and Type | Class and Description |
---|---|
static class |
HttpCrawlerConfig.HttpMethodSupport |
static class |
HttpCrawlerConfig.ReferencedLinkType |
CrawlerConfig.OrphansStrategy
Constructor and Description |
---|
HttpCrawlerConfig() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
ICanonicalLinkDetector |
getCanonicalLinkDetector()
Gets the canonical link detector.
|
IDelayResolver |
getDelayResolver() |
HttpCrawlerConfig.HttpMethodSupport |
getFetchHttpGet()
Gets whether to fetch HTTP documents using an
HTTP GET request.
|
HttpCrawlerConfig.HttpMethodSupport |
getFetchHttpHead()
Gets whether to fetch HTTP response headers using an
HTTP HEAD request.
|
List<IHttpFetcher> |
getHttpFetchers()
Gets HTTP fetchers.
|
int |
getHttpFetchersMaxRetries()
Gets the maximum number of times an HTTP fetcher will re-attempt fetching
a resource in case of failures.
|
long |
getHttpFetchersRetryDelay()
Gets how long to wait before a failing HTTP fetcher re-attempts fetching
a resource in case of failures (in milliseconds).
|
Set<HttpCrawlerConfig.ReferencedLinkType> |
getKeepReferencedLinks()
Gets what type of referenced links to keep, if any.
|
List<ILinkExtractor> |
getLinkExtractors()
Gets link extractors.
|
int |
getMaxDepth() |
TextMatcher |
getPostImportLinks()
Gets a field matcher used to identify post-import metadata fields
holding URLs to consider for crawling.
|
List<IHttpDocumentProcessor> |
getPostImportProcessors()
Gets post-import processors.
|
List<IHttpDocumentProcessor> |
getPreImportProcessors()
Gets pre-import processors.
|
IRecrawlableResolver |
getRecrawlableResolver()
Gets the recrawlable resolver.
|
IRobotsMetaProvider |
getRobotsMetaProvider() |
IRobotsTxtProvider |
getRobotsTxtProvider() |
ISitemapResolver |
getSitemapResolver() |
List<String> |
getStartSitemapURLs()
Gets sitemap URLs to be used as starting points for crawling.
|
List<String> |
getStartURLs()
Gets URLs to initiate crawling from.
|
List<Path> |
getStartURLsFiles()
Gets the file paths of seed files containing URLs to be used as
"start URLs".
|
List<IStartURLsProvider> |
getStartURLsProviders()
Gets the providers of URLs used as starting points for crawling.
|
URLCrawlScopeStrategy |
getURLCrawlScopeStrategy()
Gets the strategy to use to determine if a URL is in scope.
|
IURLNormalizer |
getUrlNormalizer() |
int |
hashCode() |
boolean |
isFetchHttpHead()
Deprecated.
Use
getFetchHttpHead() . |
boolean |
isIgnoreCanonicalLinks()
Whether canonical links found in HTTP headers and in HTML files
<head> section should be ignored or processed.
|
boolean |
isIgnoreRobotsMeta() |
boolean |
isIgnoreRobotsTxt() |
boolean |
isIgnoreSitemap()
Whether to ignore sitemap detection and resolving for URLs processed.
|
boolean |
isKeepDownloads() |
boolean |
isKeepOutOfScopeLinks()
Deprecated.
Since 3.0.0, use
getKeepReferencedLinks() . |
boolean |
isPostImportLinksKeep()
Gets whether to keep the importer-generated field holding URLs to
consider for crawling.
|
boolean |
isStartURLsAsync()
Gets whether the start URLs should be loaded asynchronously.
|
protected void |
loadCrawlerConfigFromXML(XML xml) |
protected void |
saveCrawlerConfigToXML(XML xml) |
void |
setCanonicalLinkDetector(ICanonicalLinkDetector canonicalLinkDetector)
Sets the canonical link detector.
|
void |
setDelayResolver(IDelayResolver delayResolver) |
void |
setFetchHttpGet(HttpCrawlerConfig.HttpMethodSupport fetchHttpGet)
Sets whether to fetch HTTP documents using an
HTTP GET request.
|
void |
setFetchHttpHead(boolean fetchHttpHead)
Deprecated.
|
void |
setFetchHttpHead(HttpCrawlerConfig.HttpMethodSupport fetchHttpHead)
Sets whether to fetch HTTP response headers using an
HTTP HEAD request.
|
void |
setHttpFetchers(IHttpFetcher... httpFetchers)
Sets HTTP fetchers.
|
void |
setHttpFetchers(List<IHttpFetcher> httpFetchers)
Sets HTTP fetchers.
|
void |
setHttpFetchersMaxRetries(int httpFetchersMaxRetries)
Sets the maximum number of times an HTTP fetcher will re-attempt fetching
a resource in case of failures.
|
void |
setHttpFetchersRetryDelay(long httpFetchersRetryDelay)
Sets how long to wait before a failing HTTP fetcher re-attempts fetching
a resource in case of failures (in milliseconds).
|
void |
setIgnoreCanonicalLinks(boolean ignoreCanonicalLinks)
Sets whether canonical links found in HTTP headers and in HTML files
<head> section should be ignored or processed.
|
void |
setIgnoreRobotsMeta(boolean ignoreRobotsMeta) |
void |
setIgnoreRobotsTxt(boolean ignoreRobotsTxt) |
void |
setIgnoreSitemap(boolean ignoreSitemap)
Sets whether to ignore sitemap detection and resolving for URLs
processed.
|
void |
setKeepDownloads(boolean keepDownloads) |
void |
setKeepOutOfScopeLinks(boolean keepOutOfScopeLinks)
Deprecated.
Since 3.0.0, use
setKeepReferencedLinks(Set) . |
void |
setKeepReferencedLinks(HttpCrawlerConfig.ReferencedLinkType... keepReferencedLinks)
Sets whether to keep referenced links and what to keep.
|
void |
setKeepReferencedLinks(Set<HttpCrawlerConfig.ReferencedLinkType> keepReferencedLinks)
Sets whether to keep referenced links and what to keep.
|
void |
setLinkExtractors(ILinkExtractor... linkExtractors)
Sets link extractors.
|
void |
setLinkExtractors(List<ILinkExtractor> linkExtractors)
Sets link extractors.
|
void |
setMaxDepth(int depth) |
void |
setPostImportLinks(TextMatcher fieldMatcher)
Set a field matcher used to identify post-import metadata fields
holding URLs to consider for crawling.
|
void |
setPostImportLinksKeep(boolean postImportLinksKeep)
Sets whether to keep the importer-generated field holding URLs to
consider for crawling.
|
void |
setPostImportProcessors(IHttpDocumentProcessor... postImportProcessors)
Sets post-import processors.
|
void |
setPostImportProcessors(List<IHttpDocumentProcessor> postImportProcessors)
Sets post-import processors.
|
void |
setPreImportProcessors(IHttpDocumentProcessor... preImportProcessors)
Sets pre-import processors.
|
void |
setPreImportProcessors(List<IHttpDocumentProcessor> preImportProcessors)
Sets pre-import processors.
|
void |
setRecrawlableResolver(IRecrawlableResolver recrawlableResolver)
Sets the recrawlable resolver.
|
void |
setRobotsMetaProvider(IRobotsMetaProvider robotsMetaProvider) |
void |
setRobotsTxtProvider(IRobotsTxtProvider robotsTxtProvider) |
void |
setSitemapResolver(ISitemapResolver sitemapResolver) |
void |
setStartSitemapURLs(List<String> startSitemapURLs)
Sets the sitemap URLs used as starting points for crawling.
|
void |
setStartSitemapURLs(String... startSitemapURLs)
Sets the sitemap URLs used as starting points for crawling.
|
void |
setStartURLs(List<String> startURLs)
Sets URLs to initiate crawling from.
|
void |
setStartURLs(String... startURLs)
Sets URLs to initiate crawling from.
|
void |
setStartURLsAsync(boolean asyncStartURLs)
Sets whether the start URLs should be loaded asynchronously.
|
void |
setStartURLsFiles(List<Path> startURLsFiles)
Sets the file paths of seed files containing URLs to be used as
"start URLs".
|
void |
setStartURLsFiles(Path... startURLsFiles)
Sets the file paths of seed files containing URLs to be used as
"start URLs".
|
void |
setStartURLsProviders(IStartURLsProvider... startURLsProviders)
Sets the providers of URLs used as starting points for crawling.
|
void |
setStartURLsProviders(List<IStartURLsProvider> startURLsProviders)
Sets the providers of URLs used as starting points for crawling.
|
void |
setUrlCrawlScopeStrategy(URLCrawlScopeStrategy urlCrawlScopeStrategy)
Sets the strategy to use to determine if a URL is in scope.
|
void |
setUrlNormalizer(IURLNormalizer urlNormalizer) |
String |
toString() |
addEventListeners, addEventListeners, clearEventListeners, getCommitter, getCommitters, getDataStoreEngine, getDocumentChecksummer, getDocumentFilters, getEventListeners, getId, getImporterConfig, getMaxDocuments, getMetadataChecksummer, getMetadataFilters, getNumThreads, getOrphansStrategy, getReferenceFilters, getSpoiledReferenceStrategizer, getStopOnExceptions, isDocumentDeduplicate, isMetadataDeduplicate, loadFromXML, saveToXML, setCommitter, setCommitters, setCommitters, setDataStoreEngine, setDocumentChecksummer, setDocumentDeduplicate, setDocumentFilters, setDocumentFilters, setEventListeners, setEventListeners, setId, setImporterConfig, setMaxDocuments, setMetadataChecksummer, setMetadataDeduplicate, setMetadataFilters, setMetadataFilters, setNumThreads, setOrphansStrategy, setReferenceFilters, setReferenceFilters, setSpoiledReferenceStrategizer, setStopOnExceptions, setStopOnExceptions
@Deprecated public boolean isFetchHttpHead()
getFetchHttpHead()
.true
if fetching HTTP response headers separately@Deprecated public void setFetchHttpHead(boolean fetchHttpHead)
setFetchHttpHead(HttpMethodSupport)
.fetchHttpHead
- true
if fetching HTTP response headers separatelypublic HttpCrawlerConfig.HttpMethodSupport getFetchHttpHead()
Gets whether to fetch HTTP response headers using an HTTP HEAD request. That HTTP request is performed separately from a document download request (HTTP "GET"). Useful when you need to filter documents based on HTTP header values, without downloading them first (e.g., to save bandwidth). When dealing with small documents on average, it may be best to avoid issuing two requests when a single one could do it.
HttpCrawlerConfig.HttpMethodSupport.DISABLED
by default.
See class documentation for more details.
public void setFetchHttpHead(HttpCrawlerConfig.HttpMethodSupport fetchHttpHead)
Sets whether to fetch HTTP response headers using an HTTP HEAD request.
See class documentation for more details.
fetchHttpHead
- HTTP HEAD method supportpublic HttpCrawlerConfig.HttpMethodSupport getFetchHttpGet()
Gets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.
HttpCrawlerConfig.HttpMethodSupport.REQUIRED
by default.
See class documentation for more details.
true
if fetching HTTP response headers separatelypublic void setFetchHttpGet(HttpCrawlerConfig.HttpMethodSupport fetchHttpGet)
Sets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.
See class documentation for more details.
fetchHttpGet
- true
if fetching HTTP response headers separatelypublic List<String> getStartURLs()
null
)public void setStartURLs(String... startURLs)
startURLs
- start URLspublic void setStartURLs(List<String> startURLs)
startURLs
- start URLspublic List<Path> getStartURLsFiles()
null
)public void setStartURLsFiles(Path... startURLsFiles)
startURLsFiles
- file paths of seed files containing URLspublic void setStartURLsFiles(List<Path> startURLsFiles)
startURLsFiles
- file paths of seed files containing URLspublic List<String> getStartSitemapURLs()
null
)public void setStartSitemapURLs(String... startSitemapURLs)
startSitemapURLs
- sitemap URLspublic void setStartSitemapURLs(List<String> startSitemapURLs)
startSitemapURLs
- sitemap URLspublic List<IStartURLsProvider> getStartURLsProviders()
null
)public void setStartURLsProviders(IStartURLsProvider... startURLsProviders)
startURLsProviders
- start URL providerpublic void setStartURLsProviders(List<IStartURLsProvider> startURLsProviders)
startURLsProviders
- start URL providerpublic boolean isStartURLsAsync()
true
, the crawler will start processing URLs in the queue
even if start URLs are still being loaded. While this may speed up
crawling, it may have an unexpected effect on accuracy of
HttpDocMetadata.DEPTH
. Use of this option is only
recommended when start URLs takes a significant time to load (e.g.,
large sitemaps).true
if async.public void setStartURLsAsync(boolean asyncStartURLs)
true
, the crawler will start processing URLs in the queue
even if start URLs are still being loaded. While this may speed up
crawling, it may have an unexpected effect on accuracy of
HttpDocMetadata.DEPTH
. Use of this option is only
recommended when start URLs takes a significant time to load (e.g.,
large sitemaps).asyncStartURLs
- true
if async.public void setMaxDepth(int depth)
public int getMaxDepth()
public List<IHttpFetcher> getHttpFetchers()
null
)public void setHttpFetchers(IHttpFetcher... httpFetchers)
httpFetchers
- list of HTTP fetcherspublic void setHttpFetchers(List<IHttpFetcher> httpFetchers)
httpFetchers
- list of HTTP fetcherspublic int getHttpFetchersMaxRetries()
public void setHttpFetchersMaxRetries(int httpFetchersMaxRetries)
httpFetchersMaxRetries
- maximum number of retriespublic long getHttpFetchersRetryDelay()
public void setHttpFetchersRetryDelay(long httpFetchersRetryDelay)
httpFetchersRetryDelay
- retry delaypublic ICanonicalLinkDetector getCanonicalLinkDetector()
null
if none
are defined.public void setCanonicalLinkDetector(ICanonicalLinkDetector canonicalLinkDetector)
null
argument, or invoke
setIgnoreCanonicalLinks(boolean)
with a true
value.canonicalLinkDetector
- the canonical link detectorpublic List<ILinkExtractor> getLinkExtractors()
public void setLinkExtractors(ILinkExtractor... linkExtractors)
linkExtractors
- link extractorspublic void setLinkExtractors(List<ILinkExtractor> linkExtractors)
linkExtractors
- link extractorspublic IRobotsTxtProvider getRobotsTxtProvider()
public void setRobotsTxtProvider(IRobotsTxtProvider robotsTxtProvider)
public IURLNormalizer getUrlNormalizer()
public void setUrlNormalizer(IURLNormalizer urlNormalizer)
public IDelayResolver getDelayResolver()
public void setDelayResolver(IDelayResolver delayResolver)
public List<IHttpDocumentProcessor> getPreImportProcessors()
public void setPreImportProcessors(IHttpDocumentProcessor... preImportProcessors)
preImportProcessors
- pre-import processorspublic void setPreImportProcessors(List<IHttpDocumentProcessor> preImportProcessors)
preImportProcessors
- pre-import processorspublic List<IHttpDocumentProcessor> getPostImportProcessors()
public void setPostImportProcessors(IHttpDocumentProcessor... postImportProcessors)
postImportProcessors
- post-import processorspublic void setPostImportProcessors(List<IHttpDocumentProcessor> postImportProcessors)
postImportProcessors
- post-import processorspublic boolean isIgnoreRobotsTxt()
public void setIgnoreRobotsTxt(boolean ignoreRobotsTxt)
public boolean isKeepDownloads()
public void setKeepDownloads(boolean keepDownloads)
@Deprecated public boolean isKeepOutOfScopeLinks()
getKeepReferencedLinks()
.HttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE
true
if keeping URLs not in scope.@Deprecated public void setKeepOutOfScopeLinks(boolean keepOutOfScopeLinks)
setKeepReferencedLinks(Set)
.HttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE
keepOutOfScopeLinks
- true
if keeping URLs not in scopepublic Set<HttpCrawlerConfig.ReferencedLinkType> getKeepReferencedLinks()
public void setKeepReferencedLinks(Set<HttpCrawlerConfig.ReferencedLinkType> keepReferencedLinks)
keepReferencedLinks
- option for keeping linkspublic void setKeepReferencedLinks(HttpCrawlerConfig.ReferencedLinkType... keepReferencedLinks)
keepReferencedLinks
- option for keeping linkspublic boolean isIgnoreRobotsMeta()
public void setIgnoreRobotsMeta(boolean ignoreRobotsMeta)
public IRobotsMetaProvider getRobotsMetaProvider()
public void setRobotsMetaProvider(IRobotsMetaProvider robotsMetaProvider)
public boolean isIgnoreSitemap()
getStartSitemapURLs()
) are never ignored.true
to ignore sitemapspublic void setIgnoreSitemap(boolean ignoreSitemap)
getStartSitemapURLs()
) are never ignored.ignoreSitemap
- true
to ignore sitemapspublic ISitemapResolver getSitemapResolver()
public void setSitemapResolver(ISitemapResolver sitemapResolver)
public boolean isIgnoreCanonicalLinks()
true
if ignoring canonical links
processed.public void setIgnoreCanonicalLinks(boolean ignoreCanonicalLinks)
true
URL pages with a canonical URL pointer in them are notignoreCanonicalLinks
- true
if ignoring canonical linkspublic URLCrawlScopeStrategy getURLCrawlScopeStrategy()
public void setUrlCrawlScopeStrategy(URLCrawlScopeStrategy urlCrawlScopeStrategy)
urlCrawlScopeStrategy
- strategy to usepublic IRecrawlableResolver getRecrawlableResolver()
public void setRecrawlableResolver(IRecrawlableResolver recrawlableResolver)
recrawlableResolver
- the recrawlable resolverpublic TextMatcher getPostImportLinks()
public void setPostImportLinks(TextMatcher fieldMatcher)
fieldMatcher
- field matcherpublic boolean isPostImportLinksKeep()
true
if keepingpublic void setPostImportLinksKeep(boolean postImportLinksKeep)
postImportLinksKeep
- true
if keepingprotected void saveCrawlerConfigToXML(XML xml)
saveCrawlerConfigToXML
in class CrawlerConfig
protected void loadCrawlerConfigFromXML(XML xml)
loadCrawlerConfigFromXML
in class CrawlerConfig
public boolean equals(Object other)
equals
in class CrawlerConfig
public int hashCode()
hashCode
in class CrawlerConfig
public String toString()
toString
in class CrawlerConfig
Copyright © 2009–2023 Norconex Inc.. All rights reserved.