Class HttpCrawlerConfig
- All Implemented Interfaces:
IXMLConfigurable
HTTP Crawler configuration.
Start URLs
Crawling begins with one or more "start" URLs. Multiple start URLs can be defined, in a combination of ways:
- url: A start URL directly in the configuration
(see
setStartURLs(List)). - urlsFile: A path to a file containing a list of start URLs
(see
setStartURLsFiles(List)). One per line. - sitemap: A URL pointing to a sitemap XML file that contains
the URLs to crawl (see
setStartSitemapURLs(List)). - provider: Your own class implementing
IStartURLsProviderto dynamically provide a list of start URLs (seesetStartURLsProviders(List)).
Scope: To limit crawling to specific web domains, and avoid creating
many filters to that effect, you can tell the crawler to "stay" within
the web site "scope" with
setUrlCrawlScopeStrategy(URLCrawlScopeStrategy).
URL Normalization
Pages on web sites are often referenced using different URL
patterns. Such URL variations can fool the crawler into downloading the
same document multiple times. To avoid this, URLs are "normalized". That is,
they are converted so they are always formulated the same way.
By default, the crawler only applies normalization in ways that are
semantically equivalent (see GenericURLNormalizer).
Crawl Speed
Be kind to web sites you crawl. Being too aggressive can be perceived as a cyber-attack by the targeted web site (e.g., DoS attack). This can lead to your crawler being blocked.
For this reason, the crawler plays nice by default. It will wait a
few seconds between each page download, regardless of the maximum
number of threads specified or whether pages crawled are on different
web sites. This can of course be changed to be as fast as you want.
See GenericDelayResolver)
for changing default options. You can also provide your own "delay resolver"
by supplying a class implementing IDelayResolver.
Crawl Depth
The crawl depth represents how many level from the start URL the crawler
goes. From a browser user perspective, it can be seen as the number of
link "clicks" required from a start URL in order to get to a specific page.
The crawler will crawl as deep for as long as it discovers new URLs
not getting rejected by your configuration. This is not always desirable.
For instance, a web site could have dynamically generated URLs with infinite
possibilities (e.g., dynamically generated web calendars). To avoid
infinite crawls, it is recommended to limit the maximum depth to something
reasonable for your site with setMaxDepth(int).
Keeping downloaded files
Downloaded files are deleted after being processed. Set
setKeepDownloads(boolean) to true in order to preserve
them. Files will be kept under a new "downloads" folder found under
your working directory. Keep in mind this is not a method for cloning a
site. Use with caution on large sites as it can quickly
fill up the local disk space.
Keeping Referenced Links
By default the crawler stores, as metadata, URLs extracted from
documents that are in scope. Exceptions
are pages discovered at the configured maximum depth
(setMaxDepth(int)).
This can be changed using the
setKeepReferencedLinks(Set) method.
Changing this setting has no incidence on what page gets crawled.
Possible options are:
- INSCOPE: Default. Store "in-scope" links as
HttpDocMetadata.REFERENCED_URLS. - OUTSCOPE: Store "out-of-scope" links as
HttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE. - MAXDEPTH: Also store links extracted on pages at max depth. Must be used with at least one other option to have any effect.
Orphan documents
Orphans are valid documents, which on subsequent crawls can no longer be
reached (e.g. there are no longer referenced). This is
regardless whether the file has been deleted or not at the source.
You can tell the crawler how to handle those with
CrawlerConfig.setOrphansStrategy(OrphansStrategy). Possible options are:
- PROCESS: Default. Tries to crawl orphans normally as if they were still reachable by the crawler.
- IGNORE: Does nothing with orphans (not deleted, not processed)..
- DELETE: Orphans are sent to your Committer for deletion.
Error Handling
By default the crawler logs exceptions while trying to prevent them
from terminating a crawling session. There might be cases where you want
the crawler to halt upon encountering some types of exceptions.
You can do so with CrawlerConfig.setStopOnExceptions(List).
Crawler Events
The crawler fires all kind of events to notify interested parties of such
things as when a document is rejected, imported, committed, etc.).
You can listen to crawler events using CrawlerConfig.setEventListeners(List).
Data Store (Cache)
During and between crawl sessions, the crawler needs to preserve
specific information in order to keep track of
things such as a queue of document reference to process,
those already processed, whether a document has been modified since last
crawled, caching of document checksums, etc.
For this, the crawler uses a database we call a crawl data store engine.
The default implementation uses the local file system to store these
(see MVStoreDataStoreEngine). While very capable and suitable
for most sites, if you need a larger storage system, you can provide your
own implementation with CrawlerConfig.setDataStoreEngine(IDataStoreEngine).
Document Importing
The process of transforming, enhancing, parsing to extracting plain text
and many other document-specific processing activities are handled by the
Norconex Importer module. See ImporterConfig for many
additional configuration options.
Bad Documents
On a fresh crawl, documents that are unreachable or not obtained
successfully for some reason are simply logged and ignored.
On the other hand, documents that were successfully crawled once
and are suddenly failing on a subsequent crawl are considered "spoiled".
You can decide whether to grace (retry next time), delete, or ignore
those spoiled documents with
CrawlerConfig.setSpoiledReferenceStrategizer(ISpoiledReferenceStrategizer).
Committing Documents
The last step of a successful processing of a document is to
store it in your preferred target repository (or repositories).
For this to happen, you have to configure one or more Committers
corresponding to your needs or create a custom one.
You can have a look at available Committers here:
https://opensource.norconex.com/committers/
See CrawlerConfig.setCommitters(List).
HTTP Fetcher
To crawl and parse a document, it needs to be downloaded first. This is the
role of one or more HTTP Fetchers. GenericHttpFetcher is the
default implementation and can handle most web sites.
There might be cases where a more specialized way of obtaining web resources
is needed. For instance, JavaScript-generated web pages are often best
handled by web browsers. In such case you can use the
WebDriverHttpFetcher. You can also use
setHttpFetchers(List) to supply own fetcher implementation.
HTTP Methods
A fetcher typically issues an HTTP GET request to obtain a document. There might be cases where you first want to issue a separate HEAD request. One example is to filter documents based on the HTTP HEAD response information, thus possibly saving downloading large files you don't want.
You can tell the crawler how it should handle HTTP GET and HEAD requests
using using setFetchHttpGet(HttpMethodSupport) and
setFetchHttpHead(HttpMethodSupport) respectively.
For each, the options are:
- DISABLED: No HTTP call willl be made using that method.
- OPTIONAL: If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document can still be processed successfully by the other HTTP method. Only relevant when both HEAD and GET are enabled.
- REQUIRED: If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document will be rejected and won't go any further, even if the other HTTP method was or could have been successful. Only relevant when both HEAD and GET are enabled.
If you enable only one HTTP method (default), then specifying OPTIONAL or REQUIRED for it have the same effect. At least one method needs to be enabled for an HTTP request to be attempted. By default HEAD requests are DISABLED and GET are REQUIRED. If you are unsure what settings to use, keep the defaults.
Filtering Unwanted Documents
Without filtering, you would typically crawl many documents you are not interested in. There are different types filtering offered to you, occurring at different type during a URL crawling process. The sooner in a URL processing life-cycle you filter out a document the more you can improve the crawler performance. It may be important for you to understand the differences:
- Reference filters: The fastest way to exclude a document. The filtering rule applies on the URL, before any HTTP request is made for that URL. Rejected documents are not queued for processing. They are not be downloaded (thus no URLs are extracted). The specified "delay" between downloads is not applied (i.e. no delay for rejected documents).
-
Metadata filters: Applies filtering on a document metadata fields.
If
isFetchHttpHead()returnstrue, these filters will be invoked after the crawler performs a distinct HTTP HEAD request. It gives you the opportunity to filter documents based on the HTTP HEAD response to potentially save a more expensive HTTP GET request for download (but results in two HTTP requests for valid documents -- HEAD and GET). Filtering occurs before URLs are extracted.When
isFetchHttpHead()isfalse, these filters will be invoked on the metadata of the HTTP response obtained from an HTTP GET request (as the document is downloaded). Filtering occurs after URLs are extracted. - Document filters: Use when having access to the document itself (and its content) is required to apply filtering. Always triggered after a document is downloaded and after URLs are extracted, but before it is imported (Importer module).
- Importer filters: The Importer module also offers document filtering options. At that point a document is already downloaded and its links extracted. There are two types of filtering offered by the Importer: before and after document parsing. Use filters before parsing if you need to filter on raw content or want to prevent a more expensive parsing. Use filters after parsing when you need to read the content as plain text.
Robot Directives
By default, the crawler tries to respect instructions a web site as put in place for the benefit of crawlers. Here is a list of some of the popular ones that can be turned off or supports your own implementation.
-
Robot rules: Rules defined in a "robots.txt" file at the
root of a web site, or via
X-Robots-Tag. See:setIgnoreRobotsTxt(boolean),setRobotsTxtProvider(IRobotsTxtProvider),setIgnoreRobotsMeta(boolean),setRobotsMetaProvider(IRobotsMetaProvider) -
HTML "nofollow": Most HTML-oriented link extractors support
the
rel="nofollow"attribute set on HTML links. See:HtmlLinkExtractor.setIgnoreNofollow(boolean) -
Sitemap: Sitemaps XML files are auto-detected and used to find
a list of URLs to crawl. To disable detection, use
setIgnoreSitemap(boolean). -
Canonical URLs: The crawler will reject URLs that are
non-canonical, as per HTML
<meta ...>or HTTP response instructions. To crawl non-canonical pages, usesetIgnoreCanonicalLinks(boolean). -
If Modified Since: The default HTTP Fetcher
(
GenericHttpFetcher) uses theIf-Modified-Sincefeature as part of its HTTP requests for web sites supporting it (only affects incremental crawls). To turn that off, useGenericHttpFetcherConfig.setDisableIfModifiedSince(boolean).
Re-crawl Frequency
The crawler will crawl any given URL at most one time per crawling session.
It is possible to skip documents that are not yet "ready" to be re-crawled
to speed up each crawling sessions.
Sitemap.xml directives to that effect are respected by default
("frequency" and "lastmod"). You can have your own conditions for re-crawl
with setRecrawlableResolver(IRecrawlableResolver).
This feature can be used for instance, to crawl a "news" section of your
site more frequently than let's say, an "archive" section of your site.
Change Detection (Checksums)
To find out if a document has changed from one crawling session to another,
the crawler creates and keeps a digital signature, or checksum of each
crawled documents. Upon crawling the same URL again, a new checksum
is created and compared against the previous one. Any difference indicates
a modified document. There are two checksums at play, tested at
different times. One obtained from
a document metadata (default is LastModifiedMetadataChecksummer,
and one from the document itself MD5DocumentChecksummer. You can
provide your own implementation. See:
CrawlerConfig.setMetadataChecksummer(IMetadataChecksummer) and
CrawlerConfig.setDocumentChecksummer(IDocumentChecksummer).
Deduplication
EXPERIMENTAL:
The crawler can attempt to detect and reject documents considered as
duplicates within a crawler session. A document will be considered
duplicate if there was already a document processed with the same
metadata or document checksum. To enable this feature, set
CrawlerConfig.setMetadataDeduplicate(boolean) and/or
CrawlerConfig.setDocumentDeduplicate(boolean) to true. Setting
those will have no effect if the corresponding checksummers are
not set (null).
Deduplication can impact crawl performance. It is recommended you use it only if you can't distinguish duplicates via other means (URL normalizer, canonical URL support, etc.). Also, you should only enable this feature if you know your checksummer(s) will generate a checksum that is acceptably unique to you.
URL Extraction
To be able to crawl a web site, links need to be extracted from
web pages. It is the job of a link extractor. It is possible to use
multiple link extractor for different type of content. By default,
the HtmlLinkExtractor is used, but you can add others or
provide your own with setLinkExtractors(List).
There might be
cases where you want a document to be parsed by the Importer and establish
which links to process yourself during the importing phase (for more
advanced use cases). In such cases, you can identify a document metadata
field to use as a URL holding tanks after importing has occurred.
URLs in that field will become eligible for crawling.
See setPostImportLinks(TextMatcher).
XML configuration usage:
<crawler
id="(crawler unique identifier)">
<startURLs
stayOnDomain="[false|true]"
includeSubdomains="[false|true]"
stayOnPort="[false|true]"
stayOnProtocol="[false|true]"
async="[false|true]">
<!-- All the following tags are repeatable. -->
<url>(a URL)</url>
<urlsFile>(local path to a file containing URLs)</urlsFile>
<sitemap>(URL to a sitemap XML)</sitemap>
<provider
class="(IStartURLsProvider implementation)"/>
</startURLs>
<urlNormalizers>
<urlNormalizer
class="(IURLNormalizer implementation)"/>
</urlNormalizers>
<delay
class="(IDelayResolver implementation)"/>
<maxDepth>(maximum crawl depth)</maxDepth>
<keepDownloads>[false|true]</keepDownloads>
<keepReferencedLinks>[INSCOPE|OUTSCOPE|MAXDEPTH]</keepReferencedLinks>
<fetchHttpHead>[DISABLED|REQUIRED|OPTIONAL]</fetchHttpHead>
<fetchHttpGet>[REQUIRED|DISABLED|OPTIONAL]</fetchHttpGet>
<httpFetchers
maxRetries="(number of times to retry a failed fetch attempt)"
retryDelay="(how many milliseconds to wait between re-attempting)">
<!-- Repeatable -->
<fetcher
class="(IHttpFetcher implementation)"/>
</httpFetchers>
<robotsTxt
ignore="[false|true]"
class="(IRobotsMetaProvider implementation)"/>
<sitemapResolver
ignore="[false|true]"
class="(ISitemapResolver implementation)"/>
<recrawlableResolver
class="(IRecrawlableResolver implementation)"/>
<canonicalLinkDetector
ignore="[false|true]"
class="(ICanonicalLinkDetector implementation)"/>
<robotsMeta
ignore="[false|true]"
class="(IRobotsMetaProvider implementation)"/>
<linkExtractors>
<!-- Repeatable -->
<extractor
class="(ILinkExtractor implementation)"/>
</linkExtractors>
<preImportProcessors>
<!-- Repeatable -->
<processor
class="(IHttpDocumentProcessor implementation)"/>
</preImportProcessors>
<postImportProcessors>
<!-- Repeatable -->
<processor
class="(IHttpDocumentProcessor implementation)"/>
</postImportProcessors>
<postImportLinks
keep="[false|true]">
<fieldMatcher/>
</postImportLinks>
</crawler>- Author:
- Pascal Essiembre
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumstatic enumNested classes/interfaces inherited from class com.norconex.collector.core.crawler.CrawlerConfig
CrawlerConfig.OrphansStrategy -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbooleanGets the canonical link detector.Gets whether to fetch HTTP documents using an HTTP GET request.Gets whether to fetch HTTP response headers using an HTTP HEAD request.Gets HTTP fetchers.intGets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures.longGets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds).Gets what type of referenced links to keep, if any.Gets link extractors.intGets a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.Gets post-import processors.Gets pre-import processors.Gets the recrawlable resolver.Gets sitemap URLs to be used as starting points for crawling.Gets URLs to initiate crawling from.Gets the file paths of seed files containing URLs to be used as "start URLs".Gets the providers of URLs used as starting points for crawling.Gets the strategy to use to determine if a URL is in scope.Deprecated, for removal: This API element is subject to removal in a future version.Gets URL normalizers.inthashCode()booleanDeprecated.UsegetFetchHttpHead().booleanWhether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed.booleanbooleanbooleanWhether to ignore sitemap detection and resolving for URLs processed.booleanbooleanDeprecated.Since 3.0.0, usegetKeepReferencedLinks().booleanGets whether to keep the importer-generated field holding URLs to consider for crawling.booleanGets whether the start URLs should be loaded asynchronously.protected voidprotected voidvoidsetCanonicalLinkDetector(ICanonicalLinkDetector canonicalLinkDetector) Sets the canonical link detector.voidsetDelayResolver(IDelayResolver delayResolver) voidsetFetchHttpGet(HttpCrawlerConfig.HttpMethodSupport fetchHttpGet) Sets whether to fetch HTTP documents using an HTTP GET request.voidsetFetchHttpHead(boolean fetchHttpHead) Deprecated.voidsetFetchHttpHead(HttpCrawlerConfig.HttpMethodSupport fetchHttpHead) Sets whether to fetch HTTP response headers using an HTTP HEAD request.voidsetHttpFetchers(IHttpFetcher... httpFetchers) Sets HTTP fetchers.voidsetHttpFetchers(List<IHttpFetcher> httpFetchers) Sets HTTP fetchers.voidsetHttpFetchersMaxRetries(int httpFetchersMaxRetries) Sets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures.voidsetHttpFetchersRetryDelay(long httpFetchersRetryDelay) Sets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds).voidsetIgnoreCanonicalLinks(boolean ignoreCanonicalLinks) Sets whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed.voidsetIgnoreRobotsMeta(boolean ignoreRobotsMeta) voidsetIgnoreRobotsTxt(boolean ignoreRobotsTxt) voidsetIgnoreSitemap(boolean ignoreSitemap) Sets whether to ignore sitemap detection and resolving for URLs processed.voidsetKeepDownloads(boolean keepDownloads) voidsetKeepOutOfScopeLinks(boolean keepOutOfScopeLinks) Deprecated.Since 3.0.0, usesetKeepReferencedLinks(Set).voidsetKeepReferencedLinks(HttpCrawlerConfig.ReferencedLinkType... keepReferencedLinks) Sets whether to keep referenced links and what to keep.voidsetKeepReferencedLinks(Set<HttpCrawlerConfig.ReferencedLinkType> keepReferencedLinks) Sets whether to keep referenced links and what to keep.voidsetLinkExtractors(ILinkExtractor... linkExtractors) Sets link extractors.voidsetLinkExtractors(List<ILinkExtractor> linkExtractors) Sets link extractors.voidsetMaxDepth(int depth) voidsetPostImportLinks(TextMatcher fieldMatcher) Set a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.voidsetPostImportLinksKeep(boolean postImportLinksKeep) Sets whether to keep the importer-generated field holding URLs to consider for crawling.voidsetPostImportProcessors(IHttpDocumentProcessor... postImportProcessors) Sets post-import processors.voidsetPostImportProcessors(List<IHttpDocumentProcessor> postImportProcessors) Sets post-import processors.voidsetPreImportProcessors(IHttpDocumentProcessor... preImportProcessors) Sets pre-import processors.voidsetPreImportProcessors(List<IHttpDocumentProcessor> preImportProcessors) Sets pre-import processors.voidsetRecrawlableResolver(IRecrawlableResolver recrawlableResolver) Sets the recrawlable resolver.voidsetRobotsMetaProvider(IRobotsMetaProvider robotsMetaProvider) voidsetRobotsTxtProvider(IRobotsTxtProvider robotsTxtProvider) voidsetSitemapResolver(ISitemapResolver sitemapResolver) voidsetStartSitemapURLs(String... startSitemapURLs) Sets the sitemap URLs used as starting points for crawling.voidsetStartSitemapURLs(List<String> startSitemapURLs) Sets the sitemap URLs used as starting points for crawling.voidsetStartURLs(String... startURLs) Sets URLs to initiate crawling from.voidsetStartURLs(List<String> startURLs) Sets URLs to initiate crawling from.voidsetStartURLsAsync(boolean asyncStartURLs) Sets whether the start URLs should be loaded asynchronously.voidsetStartURLsFiles(Path... startURLsFiles) Sets the file paths of seed files containing URLs to be used as "start URLs".voidsetStartURLsFiles(List<Path> startURLsFiles) Sets the file paths of seed files containing URLs to be used as "start URLs".voidsetStartURLsProviders(IStartURLsProvider... startURLsProviders) Sets the providers of URLs used as starting points for crawling.voidsetStartURLsProviders(List<IStartURLsProvider> startURLsProviders) Sets the providers of URLs used as starting points for crawling.voidsetUrlCrawlScopeStrategy(URLCrawlScopeStrategy urlCrawlScopeStrategy) Sets the strategy to use to determine if a URL is in scope.voidsetUrlNormalizer(IURLNormalizer urlNormalizer) Deprecated, for removal: This API element is subject to removal in a future version.Since 3.1.0, usesetUrlNormalizers(List)instead.voidsetUrlNormalizers(List<IURLNormalizer> urlNormalizers) Sets URL normalizers.toString()Methods inherited from class com.norconex.collector.core.crawler.CrawlerConfig
addEventListeners, addEventListeners, clearEventListeners, getCommitter, getCommitters, getDataStoreEngine, getDocumentChecksummer, getDocumentFilters, getEventListeners, getId, getImporterConfig, getMaxDocuments, getMetadataChecksummer, getMetadataFilters, getNumThreads, getOrphansStrategy, getReferenceFilters, getSpoiledReferenceStrategizer, getStopOnExceptions, isDocumentDeduplicate, isMetadataDeduplicate, loadFromXML, saveToXML, setCommitter, setCommitters, setCommitters, setDataStoreEngine, setDocumentChecksummer, setDocumentDeduplicate, setDocumentFilters, setDocumentFilters, setEventListeners, setEventListeners, setId, setImporterConfig, setMaxDocuments, setMetadataChecksummer, setMetadataDeduplicate, setMetadataFilters, setMetadataFilters, setNumThreads, setOrphansStrategy, setReferenceFilters, setReferenceFilters, setSpoiledReferenceStrategizer, setStopOnExceptions, setStopOnExceptions
-
Constructor Details
-
HttpCrawlerConfig
public HttpCrawlerConfig()
-
-
Method Details
-
isFetchHttpHead
Deprecated.UsegetFetchHttpHead().Deprecated.- Returns:
trueif fetching HTTP response headers separately- Since:
- 3.0.0-M1
-
setFetchHttpHead
Deprecated.Deprecated.- Parameters:
fetchHttpHead-trueif fetching HTTP response headers separately- Since:
- 3.0.0-M1
-
getFetchHttpHead
Gets whether to fetch HTTP response headers using an HTTP HEAD request. That HTTP request is performed separately from a document download request (HTTP "GET"). Useful when you need to filter documents based on HTTP header values, without downloading them first (e.g., to save bandwidth). When dealing with small documents on average, it may be best to avoid issuing two requests when a single one could do it.
HttpCrawlerConfig.HttpMethodSupport.DISABLEDby default. See class documentation for more details.- Returns:
- HTTP HEAD method support
- Since:
- 3.0.0
-
setFetchHttpHead
Sets whether to fetch HTTP response headers using an HTTP HEAD request.
See class documentation for more details.
- Parameters:
fetchHttpHead- HTTP HEAD method support- Since:
- 3.0.0
-
getFetchHttpGet
Gets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.
HttpCrawlerConfig.HttpMethodSupport.REQUIREDby default. See class documentation for more details.- Returns:
trueif fetching HTTP response headers separately- Since:
- 3.0.0
-
setFetchHttpGet
Sets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.
See class documentation for more details.
- Parameters:
fetchHttpGet-trueif fetching HTTP response headers separately- Since:
- 3.0.0
-
getStartURLs
Gets URLs to initiate crawling from.- Returns:
- start URLs (never
null)
-
setStartURLs
Sets URLs to initiate crawling from.- Parameters:
startURLs- start URLs
-
setStartURLs
Sets URLs to initiate crawling from.- Parameters:
startURLs- start URLs- Since:
- 3.0.0
-
getStartURLsFiles
Gets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.- Returns:
- file paths of seed files containing URLs
(never
null) - Since:
- 2.3.0
-
setStartURLsFiles
Sets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.- Parameters:
startURLsFiles- file paths of seed files containing URLs- Since:
- 2.3.0
-
setStartURLsFiles
Sets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.- Parameters:
startURLsFiles- file paths of seed files containing URLs- Since:
- 3.0.0
-
getStartSitemapURLs
Gets sitemap URLs to be used as starting points for crawling.- Returns:
- sitemap URLs (never
null) - Since:
- 2.3.0
-
setStartSitemapURLs
Sets the sitemap URLs used as starting points for crawling.- Parameters:
startSitemapURLs- sitemap URLs- Since:
- 2.3.0
-
setStartSitemapURLs
Sets the sitemap URLs used as starting points for crawling.- Parameters:
startSitemapURLs- sitemap URLs- Since:
- 3.0.0
-
getStartURLsProviders
Gets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.- Returns:
- start URL providers (never
null) - Since:
- 2.7.0
-
setStartURLsProviders
Sets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.- Parameters:
startURLsProviders- start URL provider- Since:
- 2.7.0
-
setStartURLsProviders
Sets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.- Parameters:
startURLsProviders- start URL provider- Since:
- 3.0.0
-
isStartURLsAsync
public boolean isStartURLsAsync()Gets whether the start URLs should be loaded asynchronously. Whentrue, the crawler will start processing URLs in the queue even if start URLs are still being loaded. While this may speed up crawling, it may have an unexpected effect on accuracy ofHttpDocMetadata.DEPTH. Use of this option is only recommended when start URLs takes a significant time to load (e.g., large sitemaps).- Returns:
trueif async.- Since:
- 3.0.0
-
setStartURLsAsync
public void setStartURLsAsync(boolean asyncStartURLs) Sets whether the start URLs should be loaded asynchronously. Whentrue, the crawler will start processing URLs in the queue even if start URLs are still being loaded. While this may speed up crawling, it may have an unexpected effect on accuracy ofHttpDocMetadata.DEPTH. Use of this option is only recommended when start URLs takes a significant time to load (e.g., large sitemaps).- Parameters:
asyncStartURLs-trueif async.- Since:
- 3.0.0
-
setMaxDepth
public void setMaxDepth(int depth) -
getMaxDepth
public int getMaxDepth() -
getHttpFetchers
Gets HTTP fetchers.- Returns:
- start URLs (never
null) - Since:
- 3.0.0
-
setHttpFetchers
Sets HTTP fetchers.- Parameters:
httpFetchers- list of HTTP fetchers- Since:
- 3.0.0
-
setHttpFetchers
Sets HTTP fetchers.- Parameters:
httpFetchers- list of HTTP fetchers- Since:
- 3.0.0
-
getHttpFetchersMaxRetries
public int getHttpFetchersMaxRetries()Gets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures. Default is zero (won't retry).- Returns:
- number of times
- Since:
- 3.0.0
-
setHttpFetchersMaxRetries
public void setHttpFetchersMaxRetries(int httpFetchersMaxRetries) Sets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures.- Parameters:
httpFetchersMaxRetries- maximum number of retries- Since:
- 3.0.0
-
getHttpFetchersRetryDelay
public long getHttpFetchersRetryDelay()Gets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds). Default is zero (no delay).- Returns:
- retry delay
- Since:
- 3.0.0
-
setHttpFetchersRetryDelay
public void setHttpFetchersRetryDelay(long httpFetchersRetryDelay) Sets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds).- Parameters:
httpFetchersRetryDelay- retry delay- Since:
- 3.0.0
-
getCanonicalLinkDetector
Gets the canonical link detector.- Returns:
- the canonical link detector, or
nullif none are defined. - Since:
- 2.2.0
-
setCanonicalLinkDetector
Sets the canonical link detector. To disable canonical link detection, either pass anullargument, or invokesetIgnoreCanonicalLinks(boolean)with atruevalue.- Parameters:
canonicalLinkDetector- the canonical link detector- Since:
- 2.2.0
-
getLinkExtractors
Gets link extractors.- Returns:
- link extractors
-
setLinkExtractors
Sets link extractors.- Parameters:
linkExtractors- link extractors
-
setLinkExtractors
Sets link extractors.- Parameters:
linkExtractors- link extractors- Since:
- 3.0.0
-
getRobotsTxtProvider
-
setRobotsTxtProvider
-
getUrlNormalizer
Deprecated, for removal: This API element is subject to removal in a future version.Since 3.1.0, usegetUrlNormalizers()instead.- Returns:
- URL normalizer
-
setUrlNormalizer
@Deprecated(forRemoval=true, since="3.1.0") public void setUrlNormalizer(IURLNormalizer urlNormalizer) Deprecated, for removal: This API element is subject to removal in a future version.Since 3.1.0, usesetUrlNormalizers(List)instead.- Parameters:
urlNormalizer- URL normalizer
-
getUrlNormalizers
Gets URL normalizers. Defaults to a singleGenericURLNormalizerinstance (with its default configuration).- Returns:
- URL normalizers or an empty list (never
null) - Since:
- 3.1.0
-
setUrlNormalizers
Sets URL normalizers.- Parameters:
urlNormalizers- URL normalizers- Since:
- 3.1.0
-
getDelayResolver
-
setDelayResolver
-
getPreImportProcessors
Gets pre-import processors.- Returns:
- pre-import processors
-
setPreImportProcessors
Sets pre-import processors.- Parameters:
preImportProcessors- pre-import processors
-
setPreImportProcessors
Sets pre-import processors.- Parameters:
preImportProcessors- pre-import processors- Since:
- 3.0.0
-
getPostImportProcessors
Gets post-import processors.- Returns:
- post-import processors
-
setPostImportProcessors
Sets post-import processors.- Parameters:
postImportProcessors- post-import processors
-
setPostImportProcessors
Sets post-import processors.- Parameters:
postImportProcessors- post-import processors- Since:
- 3.0.0
-
isIgnoreRobotsTxt
public boolean isIgnoreRobotsTxt() -
setIgnoreRobotsTxt
public void setIgnoreRobotsTxt(boolean ignoreRobotsTxt) -
isKeepDownloads
public boolean isKeepDownloads() -
setKeepDownloads
public void setKeepDownloads(boolean keepDownloads) -
isKeepOutOfScopeLinks
Deprecated.Since 3.0.0, usegetKeepReferencedLinks().Whether links not in scope should be stored as metadata underHttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE- Returns:
trueif keeping URLs not in scope.- Since:
- 2.8.0
-
setKeepOutOfScopeLinks
Deprecated.Since 3.0.0, usesetKeepReferencedLinks(Set).Sets whether links not in scope should be stored as metadata underHttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE- Parameters:
keepOutOfScopeLinks-trueif keeping URLs not in scope- Since:
- 2.8.0
-
getKeepReferencedLinks
Gets what type of referenced links to keep, if any. Those links are URLs extracted by link extractors. See class documentation for more details.- Returns:
- preferences for keeping links
- Since:
- 3.0.0
-
setKeepReferencedLinks
Sets whether to keep referenced links and what to keep. Those links are URLs extracted by link extractors. See class documentation for more details.- Parameters:
keepReferencedLinks- option for keeping links- Since:
- 3.0.0
-
setKeepReferencedLinks
Sets whether to keep referenced links and what to keep. Those links are URLs extracted by link extractors. See class documentation for more details.- Parameters:
keepReferencedLinks- option for keeping links- Since:
- 3.0.0
-
isIgnoreRobotsMeta
public boolean isIgnoreRobotsMeta() -
setIgnoreRobotsMeta
public void setIgnoreRobotsMeta(boolean ignoreRobotsMeta) -
getRobotsMetaProvider
-
setRobotsMetaProvider
-
isIgnoreSitemap
public boolean isIgnoreSitemap()Whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()) are never ignored.- Returns:
trueto ignore sitemaps
-
setIgnoreSitemap
public void setIgnoreSitemap(boolean ignoreSitemap) Sets whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()) are never ignored.- Parameters:
ignoreSitemap-trueto ignore sitemaps
-
getSitemapResolver
-
setSitemapResolver
-
isIgnoreCanonicalLinks
public boolean isIgnoreCanonicalLinks()Whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. When processed (default), URL pages with a canonical URL pointer in them are not processed.- Returns:
trueif ignoring canonical links processed.- Since:
- 2.2.0
-
setIgnoreCanonicalLinks
public void setIgnoreCanonicalLinks(boolean ignoreCanonicalLinks) Sets whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. IftrueURL pages with a canonical URL pointer in them are not- Parameters:
ignoreCanonicalLinks-trueif ignoring canonical links- Since:
- 2.2.0
-
getURLCrawlScopeStrategy
Gets the strategy to use to determine if a URL is in scope.- Returns:
- the strategy
-
setUrlCrawlScopeStrategy
Sets the strategy to use to determine if a URL is in scope.- Parameters:
urlCrawlScopeStrategy- strategy to use- Since:
- 2.8.1
-
getRecrawlableResolver
Gets the recrawlable resolver.- Returns:
- recrawlable resolver
- Since:
- 2.5.0
-
setRecrawlableResolver
Sets the recrawlable resolver.- Parameters:
recrawlableResolver- the recrawlable resolver- Since:
- 2.5.0
-
getPostImportLinks
Gets a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.- Returns:
- field matcher
- Since:
- 3.0.0
-
setPostImportLinks
Set a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.- Parameters:
fieldMatcher- field matcher- Since:
- 3.0.0
-
isPostImportLinksKeep
public boolean isPostImportLinksKeep()Gets whether to keep the importer-generated field holding URLs to consider for crawling.- Returns:
trueif keeping- Since:
- 3.0.0
-
setPostImportLinksKeep
public void setPostImportLinksKeep(boolean postImportLinksKeep) Sets whether to keep the importer-generated field holding URLs to consider for crawling.- Parameters:
postImportLinksKeep-trueif keeping- Since:
- 3.0.0
-
saveCrawlerConfigToXML
- Specified by:
saveCrawlerConfigToXMLin classCrawlerConfig
-
loadCrawlerConfigFromXML
- Specified by:
loadCrawlerConfigFromXMLin classCrawlerConfig
-
equals
- Overrides:
equalsin classCrawlerConfig
-
hashCode
public int hashCode()- Overrides:
hashCodein classCrawlerConfig
-
toString
- Overrides:
toStringin classCrawlerConfig
-
getUrlNormalizers()instead.