Class HttpCrawlerConfig
- java.lang.Object
-
- com.norconex.collector.core.crawler.CrawlerConfig
-
- com.norconex.collector.http.crawler.HttpCrawlerConfig
-
- All Implemented Interfaces:
IXMLConfigurable
public class HttpCrawlerConfig extends CrawlerConfig
HTTP Crawler configuration.
Start URLs
Crawling begins with one or more "start" URLs. Multiple start URLs can be defined, in a combination of ways:
- url: A start URL directly in the configuration
(see
setStartURLs(List)
). - urlsFile: A path to a file containing a list of start URLs
(see
setStartURLsFiles(List)
). One per line. - sitemap: A URL pointing to a sitemap XML file that contains
the URLs to crawl (see
setStartSitemapURLs(List)
). - provider: Your own class implementing
IStartURLsProvider
to dynamically provide a list of start URLs (seesetStartURLsProviders(List)
).
Scope: To limit crawling to specific web domains, and avoid creating many filters to that effect, you can tell the crawler to "stay" within the web site "scope" with
setUrlCrawlScopeStrategy(URLCrawlScopeStrategy)
.URL Normalization
Pages on web sites are often referenced using different URL patterns. Such URL variations can fool the crawler into downloading the same document multiple times. To avoid this, URLs are "normalized". That is, they are converted so they are always formulated the same way. By default, the crawler only applies normalization in ways that are semantically equivalent (see
GenericURLNormalizer
).Crawl Speed
Be kind to web sites you crawl. Being too aggressive can be perceived as a cyber-attack by the targeted web site (e.g., DoS attack). This can lead to your crawler being blocked.
For this reason, the crawler plays nice by default. It will wait a few seconds between each page download, regardless of the maximum number of threads specified or whether pages crawled are on different web sites. This can of course be changed to be as fast as you want. See
GenericDelayResolver
) for changing default options. You can also provide your own "delay resolver" by supplying a class implementingIDelayResolver
.Crawl Depth
The crawl depth represents how many level from the start URL the crawler goes. From a browser user perspective, it can be seen as the number of link "clicks" required from a start URL in order to get to a specific page. The crawler will crawl as deep for as long as it discovers new URLs not getting rejected by your configuration. This is not always desirable. For instance, a web site could have dynamically generated URLs with infinite possibilities (e.g., dynamically generated web calendars). To avoid infinite crawls, it is recommended to limit the maximum depth to something reasonable for your site with
setMaxDepth(int)
.Keeping downloaded files
Downloaded files are deleted after being processed. Set
setKeepDownloads(boolean)
totrue
in order to preserve them. Files will be kept under a new "downloads" folder found under your working directory. Keep in mind this is not a method for cloning a site. Use with caution on large sites as it can quickly fill up the local disk space.Keeping Referenced Links
By default the crawler stores, as metadata, URLs extracted from documents that are in scope. Exceptions are pages discovered at the configured maximum depth (
setMaxDepth(int)
). This can be changed using thesetKeepReferencedLinks(Set)
method. Changing this setting has no incidence on what page gets crawled. Possible options are:- INSCOPE: Default. Store "in-scope" links as
HttpDocMetadata.REFERENCED_URLS
. - OUTSCOPE: Store "out-of-scope" links as
HttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE
. - MAXDEPTH: Also store links extracted on pages at max depth. Must be used with at least one other option to have any effect.
Orphan documents
Orphans are valid documents, which on subsequent crawls can no longer be reached (e.g. there are no longer referenced). This is regardless whether the file has been deleted or not at the source. You can tell the crawler how to handle those with
CrawlerConfig.setOrphansStrategy(OrphansStrategy)
. Possible options are:- PROCESS: Default. Tries to crawl orphans normally as if they were still reachable by the crawler.
- IGNORE: Does nothing with orphans (not deleted, not processed)..
- DELETE: Orphans are sent to your Committer for deletion.
Error Handling
By default the crawler logs exceptions while trying to prevent them from terminating a crawling session. There might be cases where you want the crawler to halt upon encountering some types of exceptions. You can do so with
CrawlerConfig.setStopOnExceptions(List)
.Crawler Events
The crawler fires all kind of events to notify interested parties of such things as when a document is rejected, imported, committed, etc.). You can listen to crawler events using
CrawlerConfig.setEventListeners(List)
.Data Store (Cache)
During and between crawl sessions, the crawler needs to preserve specific information in order to keep track of things such as a queue of document reference to process, those already processed, whether a document has been modified since last crawled, caching of document checksums, etc. For this, the crawler uses a database we call a crawl data store engine. The default implementation uses the local file system to store these (see
MVStoreDataStoreEngine
). While very capable and suitable for most sites, if you need a larger storage system, you can provide your own implementation withCrawlerConfig.setDataStoreEngine(IDataStoreEngine)
.Document Importing
The process of transforming, enhancing, parsing to extracting plain text and many other document-specific processing activities are handled by the Norconex Importer module. See
ImporterConfig
for many additional configuration options.Bad Documents
On a fresh crawl, documents that are unreachable or not obtained successfully for some reason are simply logged and ignored. On the other hand, documents that were successfully crawled once and are suddenly failing on a subsequent crawl are considered "spoiled". You can decide whether to grace (retry next time), delete, or ignore those spoiled documents with
CrawlerConfig.setSpoiledReferenceStrategizer(ISpoiledReferenceStrategizer)
.Committing Documents
The last step of a successful processing of a document is to store it in your preferred target repository (or repositories). For this to happen, you have to configure one or more Committers corresponding to your needs or create a custom one. You can have a look at available Committers here: https://opensource.norconex.com/committers/ See
CrawlerConfig.setCommitters(List)
.HTTP Fetcher
To crawl and parse a document, it needs to be downloaded first. This is the role of one or more HTTP Fetchers.
GenericHttpFetcher
is the default implementation and can handle most web sites. There might be cases where a more specialized way of obtaining web resources is needed. For instance, JavaScript-generated web pages are often best handled by web browsers. In such case you can use theWebDriverHttpFetcher
. You can also usesetHttpFetchers(List)
to supply own fetcher implementation.HTTP Methods
A fetcher typically issues an HTTP GET request to obtain a document. There might be cases where you first want to issue a separate HEAD request. One example is to filter documents based on the HTTP HEAD response information, thus possibly saving downloading large files you don't want.
You can tell the crawler how it should handle HTTP GET and HEAD requests using using
setFetchHttpGet(HttpMethodSupport)
andsetFetchHttpHead(HttpMethodSupport)
respectively. For each, the options are:- DISABLED: No HTTP call willl be made using that method.
- OPTIONAL: If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document can still be processed successfully by the other HTTP method. Only relevant when both HEAD and GET are enabled.
- REQUIRED: If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document will be rejected and won't go any further, even if the other HTTP method was or could have been successful. Only relevant when both HEAD and GET are enabled.
If you enable only one HTTP method (default), then specifying OPTIONAL or REQUIRED for it have the same effect. At least one method needs to be enabled for an HTTP request to be attempted. By default HEAD requests are DISABLED and GET are REQUIRED. If you are unsure what settings to use, keep the defaults.
Filtering Unwanted Documents
Without filtering, you would typically crawl many documents you are not interested in. There are different types filtering offered to you, occurring at different type during a URL crawling process. The sooner in a URL processing life-cycle you filter out a document the more you can improve the crawler performance. It may be important for you to understand the differences:
- Reference filters: The fastest way to exclude a document. The filtering rule applies on the URL, before any HTTP request is made for that URL. Rejected documents are not queued for processing. They are not be downloaded (thus no URLs are extracted). The specified "delay" between downloads is not applied (i.e. no delay for rejected documents).
-
Metadata filters: Applies filtering on a document metadata fields.
If
isFetchHttpHead()
returnstrue
, these filters will be invoked after the crawler performs a distinct HTTP HEAD request. It gives you the opportunity to filter documents based on the HTTP HEAD response to potentially save a more expensive HTTP GET request for download (but results in two HTTP requests for valid documents -- HEAD and GET). Filtering occurs before URLs are extracted.When
isFetchHttpHead()
isfalse
, these filters will be invoked on the metadata of the HTTP response obtained from an HTTP GET request (as the document is downloaded). Filtering occurs after URLs are extracted. - Document filters: Use when having access to the document itself (and its content) is required to apply filtering. Always triggered after a document is downloaded and after URLs are extracted, but before it is imported (Importer module).
- Importer filters: The Importer module also offers document filtering options. At that point a document is already downloaded and its links extracted. There are two types of filtering offered by the Importer: before and after document parsing. Use filters before parsing if you need to filter on raw content or want to prevent a more expensive parsing. Use filters after parsing when you need to read the content as plain text.
Robot Directives
By default, the crawler tries to respect instructions a web site as put in place for the benefit of crawlers. Here is a list of some of the popular ones that can be turned off or supports your own implementation.
-
Robot rules: Rules defined in a "robots.txt" file at the
root of a web site, or via
X-Robots-Tag
. See:setIgnoreRobotsTxt(boolean)
,setRobotsTxtProvider(IRobotsTxtProvider)
,setIgnoreRobotsMeta(boolean)
,setRobotsMetaProvider(IRobotsMetaProvider)
-
HTML "nofollow": Most HTML-oriented link extractors support
the
rel="nofollow"
attribute set on HTML links. See:HtmlLinkExtractor.setIgnoreNofollow(boolean)
-
Sitemap: Sitemaps XML files are auto-detected and used to find
a list of URLs to crawl. To disable detection, use
setIgnoreSitemap(boolean)
. -
Canonical URLs: The crawler will reject URLs that are
non-canonical, as per HTML
<meta ...>
or HTTP response instructions. To crawl non-canonical pages, usesetIgnoreCanonicalLinks(boolean)
. -
If Modified Since: The default HTTP Fetcher
(
GenericHttpFetcher
) uses theIf-Modified-Since
feature as part of its HTTP requests for web sites supporting it (only affects incremental crawls). To turn that off, useGenericHttpFetcherConfig.setDisableIfModifiedSince(boolean)
.
Re-crawl Frequency
The crawler will crawl any given URL at most one time per crawling session. It is possible to skip documents that are not yet "ready" to be re-crawled to speed up each crawling sessions. Sitemap.xml directives to that effect are respected by default ("frequency" and "lastmod"). You can have your own conditions for re-crawl with
setRecrawlableResolver(IRecrawlableResolver)
. This feature can be used for instance, to crawl a "news" section of your site more frequently than let's say, an "archive" section of your site.Change Detection (Checksums)
To find out if a document has changed from one crawling session to another, the crawler creates and keeps a digital signature, or checksum of each crawled documents. Upon crawling the same URL again, a new checksum is created and compared against the previous one. Any difference indicates a modified document. There are two checksums at play, tested at different times. One obtained from a document metadata (default is
LastModifiedMetadataChecksummer
, and one from the document itselfMD5DocumentChecksummer
. You can provide your own implementation. See:CrawlerConfig.setMetadataChecksummer(IMetadataChecksummer)
andCrawlerConfig.setDocumentChecksummer(IDocumentChecksummer)
.Deduplication
EXPERIMENTAL: The crawler can attempt to detect and reject documents considered as duplicates within a crawler session. A document will be considered duplicate if there was already a document processed with the same metadata or document checksum. To enable this feature, set
CrawlerConfig.setMetadataDeduplicate(boolean)
and/orCrawlerConfig.setDocumentDeduplicate(boolean)
totrue
. Setting those will have no effect if the corresponding checksummers are not set (null
).Deduplication can impact crawl performance. It is recommended you use it only if you can't distinguish duplicates via other means (URL normalizer, canonical URL support, etc.). Also, you should only enable this feature if you know your checksummer(s) will generate a checksum that is acceptably unique to you.
URL Extraction
To be able to crawl a web site, links need to be extracted from web pages. It is the job of a link extractor. It is possible to use multiple link extractor for different type of content. By default, the
HtmlLinkExtractor
is used, but you can add others or provide your own withsetLinkExtractors(List)
.There might be cases where you want a document to be parsed by the Importer and establish which links to process yourself during the importing phase (for more advanced use cases). In such cases, you can identify a document metadata field to use as a URL holding tanks after importing has occurred. URLs in that field will become eligible for crawling. See
setPostImportLinks(TextMatcher)
.XML configuration usage:
<crawler id="(crawler unique identifier)"> <startURLs stayOnDomain="[false|true]" includeSubdomains="[false|true]" stayOnPort="[false|true]" stayOnProtocol="[false|true]" async="[false|true]"> <!-- All the following tags are repeatable. --> <url>(a URL)</url> <urlsFile>(local path to a file containing URLs)</urlsFile> <sitemap>(URL to a sitemap XML)</sitemap> <provider class="(IStartURLsProvider implementation)"/> </startURLs> <urlNormalizers> <urlNormalizer class="(IURLNormalizer implementation)"/> </urlNormalizers> <delay class="(IDelayResolver implementation)"/> <maxDepth>(maximum crawl depth)</maxDepth> <keepDownloads>[false|true]</keepDownloads> <keepReferencedLinks>[INSCOPE|OUTSCOPE|MAXDEPTH]</keepReferencedLinks> <fetchHttpHead>[DISABLED|REQUIRED|OPTIONAL]</fetchHttpHead> <fetchHttpGet>[REQUIRED|DISABLED|OPTIONAL]</fetchHttpGet> <httpFetchers maxRetries="(number of times to retry a failed fetch attempt)" retryDelay="(how many milliseconds to wait between re-attempting)"> <!-- Repeatable --> <fetcher class="(IHttpFetcher implementation)"/> </httpFetchers> <robotsTxt ignore="[false|true]" class="(IRobotsMetaProvider implementation)"/> <sitemapResolver ignore="[false|true]" class="(ISitemapResolver implementation)"/> <recrawlableResolver class="(IRecrawlableResolver implementation)"/> <canonicalLinkDetector ignore="[false|true]" class="(ICanonicalLinkDetector implementation)"/> <robotsMeta ignore="[false|true]" class="(IRobotsMetaProvider implementation)"/> <linkExtractors> <!-- Repeatable --> <extractor class="(ILinkExtractor implementation)"/> </linkExtractors> <preImportProcessors> <!-- Repeatable --> <processor class="(IHttpDocumentProcessor implementation)"/> </preImportProcessors> <postImportProcessors> <!-- Repeatable --> <processor class="(IHttpDocumentProcessor implementation)"/> </postImportProcessors> <postImportLinks keep="[false|true]"> <fieldMatcher/> </postImportLinks> </crawler>
- Author:
- Pascal Essiembre
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
HttpCrawlerConfig.HttpMethodSupport
static class
HttpCrawlerConfig.ReferencedLinkType
-
Nested classes/interfaces inherited from class com.norconex.collector.core.crawler.CrawlerConfig
CrawlerConfig.OrphansStrategy
-
-
Constructor Summary
Constructors Constructor Description HttpCrawlerConfig()
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description boolean
equals(Object other)
ICanonicalLinkDetector
getCanonicalLinkDetector()
Gets the canonical link detector.IDelayResolver
getDelayResolver()
HttpCrawlerConfig.HttpMethodSupport
getFetchHttpGet()
Gets whether to fetch HTTP documents using an HTTP GET request.HttpCrawlerConfig.HttpMethodSupport
getFetchHttpHead()
Gets whether to fetch HTTP response headers using an HTTP HEAD request.List<IHttpFetcher>
getHttpFetchers()
Gets HTTP fetchers.int
getHttpFetchersMaxRetries()
Gets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures.long
getHttpFetchersRetryDelay()
Gets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds).Set<HttpCrawlerConfig.ReferencedLinkType>
getKeepReferencedLinks()
Gets what type of referenced links to keep, if any.List<ILinkExtractor>
getLinkExtractors()
Gets link extractors.int
getMaxDepth()
TextMatcher
getPostImportLinks()
Gets a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.List<IHttpDocumentProcessor>
getPostImportProcessors()
Gets post-import processors.List<IHttpDocumentProcessor>
getPreImportProcessors()
Gets pre-import processors.IRecrawlableResolver
getRecrawlableResolver()
Gets the recrawlable resolver.IRobotsMetaProvider
getRobotsMetaProvider()
IRobotsTxtProvider
getRobotsTxtProvider()
ISitemapResolver
getSitemapResolver()
List<String>
getStartSitemapURLs()
Gets sitemap URLs to be used as starting points for crawling.List<String>
getStartURLs()
Gets URLs to initiate crawling from.List<Path>
getStartURLsFiles()
Gets the file paths of seed files containing URLs to be used as "start URLs".List<IStartURLsProvider>
getStartURLsProviders()
Gets the providers of URLs used as starting points for crawling.URLCrawlScopeStrategy
getURLCrawlScopeStrategy()
Gets the strategy to use to determine if a URL is in scope.IURLNormalizer
getUrlNormalizer()
Deprecated, for removal: This API element is subject to removal in a future version.Since 3.1.0, usegetUrlNormalizers()
instead.List<IURLNormalizer>
getUrlNormalizers()
Gets URL normalizers.int
hashCode()
boolean
isFetchHttpHead()
Deprecated.UsegetFetchHttpHead()
.boolean
isIgnoreCanonicalLinks()
Whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed.boolean
isIgnoreRobotsMeta()
boolean
isIgnoreRobotsTxt()
boolean
isIgnoreSitemap()
Whether to ignore sitemap detection and resolving for URLs processed.boolean
isKeepDownloads()
boolean
isKeepOutOfScopeLinks()
Deprecated.Since 3.0.0, usegetKeepReferencedLinks()
.boolean
isPostImportLinksKeep()
Gets whether to keep the importer-generated field holding URLs to consider for crawling.boolean
isStartURLsAsync()
Gets whether the start URLs should be loaded asynchronously.protected void
loadCrawlerConfigFromXML(XML xml)
protected void
saveCrawlerConfigToXML(XML xml)
void
setCanonicalLinkDetector(ICanonicalLinkDetector canonicalLinkDetector)
Sets the canonical link detector.void
setDelayResolver(IDelayResolver delayResolver)
void
setFetchHttpGet(HttpCrawlerConfig.HttpMethodSupport fetchHttpGet)
Sets whether to fetch HTTP documents using an HTTP GET request.void
setFetchHttpHead(boolean fetchHttpHead)
Deprecated.void
setFetchHttpHead(HttpCrawlerConfig.HttpMethodSupport fetchHttpHead)
Sets whether to fetch HTTP response headers using an HTTP HEAD request.void
setHttpFetchers(IHttpFetcher... httpFetchers)
Sets HTTP fetchers.void
setHttpFetchers(List<IHttpFetcher> httpFetchers)
Sets HTTP fetchers.void
setHttpFetchersMaxRetries(int httpFetchersMaxRetries)
Sets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures.void
setHttpFetchersRetryDelay(long httpFetchersRetryDelay)
Sets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds).void
setIgnoreCanonicalLinks(boolean ignoreCanonicalLinks)
Sets whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed.void
setIgnoreRobotsMeta(boolean ignoreRobotsMeta)
void
setIgnoreRobotsTxt(boolean ignoreRobotsTxt)
void
setIgnoreSitemap(boolean ignoreSitemap)
Sets whether to ignore sitemap detection and resolving for URLs processed.void
setKeepDownloads(boolean keepDownloads)
void
setKeepOutOfScopeLinks(boolean keepOutOfScopeLinks)
Deprecated.Since 3.0.0, usesetKeepReferencedLinks(Set)
.void
setKeepReferencedLinks(HttpCrawlerConfig.ReferencedLinkType... keepReferencedLinks)
Sets whether to keep referenced links and what to keep.void
setKeepReferencedLinks(Set<HttpCrawlerConfig.ReferencedLinkType> keepReferencedLinks)
Sets whether to keep referenced links and what to keep.void
setLinkExtractors(ILinkExtractor... linkExtractors)
Sets link extractors.void
setLinkExtractors(List<ILinkExtractor> linkExtractors)
Sets link extractors.void
setMaxDepth(int depth)
void
setPostImportLinks(TextMatcher fieldMatcher)
Set a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.void
setPostImportLinksKeep(boolean postImportLinksKeep)
Sets whether to keep the importer-generated field holding URLs to consider for crawling.void
setPostImportProcessors(IHttpDocumentProcessor... postImportProcessors)
Sets post-import processors.void
setPostImportProcessors(List<IHttpDocumentProcessor> postImportProcessors)
Sets post-import processors.void
setPreImportProcessors(IHttpDocumentProcessor... preImportProcessors)
Sets pre-import processors.void
setPreImportProcessors(List<IHttpDocumentProcessor> preImportProcessors)
Sets pre-import processors.void
setRecrawlableResolver(IRecrawlableResolver recrawlableResolver)
Sets the recrawlable resolver.void
setRobotsMetaProvider(IRobotsMetaProvider robotsMetaProvider)
void
setRobotsTxtProvider(IRobotsTxtProvider robotsTxtProvider)
void
setSitemapResolver(ISitemapResolver sitemapResolver)
void
setStartSitemapURLs(String... startSitemapURLs)
Sets the sitemap URLs used as starting points for crawling.void
setStartSitemapURLs(List<String> startSitemapURLs)
Sets the sitemap URLs used as starting points for crawling.void
setStartURLs(String... startURLs)
Sets URLs to initiate crawling from.void
setStartURLs(List<String> startURLs)
Sets URLs to initiate crawling from.void
setStartURLsAsync(boolean asyncStartURLs)
Sets whether the start URLs should be loaded asynchronously.void
setStartURLsFiles(Path... startURLsFiles)
Sets the file paths of seed files containing URLs to be used as "start URLs".void
setStartURLsFiles(List<Path> startURLsFiles)
Sets the file paths of seed files containing URLs to be used as "start URLs".void
setStartURLsProviders(IStartURLsProvider... startURLsProviders)
Sets the providers of URLs used as starting points for crawling.void
setStartURLsProviders(List<IStartURLsProvider> startURLsProviders)
Sets the providers of URLs used as starting points for crawling.void
setUrlCrawlScopeStrategy(URLCrawlScopeStrategy urlCrawlScopeStrategy)
Sets the strategy to use to determine if a URL is in scope.void
setUrlNormalizer(IURLNormalizer urlNormalizer)
Deprecated, for removal: This API element is subject to removal in a future version.Since 3.1.0, usesetUrlNormalizers(List)
instead.void
setUrlNormalizers(List<IURLNormalizer> urlNormalizers)
Sets URL normalizers.String
toString()
-
Methods inherited from class com.norconex.collector.core.crawler.CrawlerConfig
addEventListeners, addEventListeners, clearEventListeners, getCommitter, getCommitters, getDataStoreEngine, getDocumentChecksummer, getDocumentFilters, getEventListeners, getId, getImporterConfig, getMaxDocuments, getMetadataChecksummer, getMetadataFilters, getNumThreads, getOrphansStrategy, getReferenceFilters, getSpoiledReferenceStrategizer, getStopOnExceptions, isDocumentDeduplicate, isMetadataDeduplicate, loadFromXML, saveToXML, setCommitter, setCommitters, setCommitters, setDataStoreEngine, setDocumentChecksummer, setDocumentDeduplicate, setDocumentFilters, setDocumentFilters, setEventListeners, setEventListeners, setId, setImporterConfig, setMaxDocuments, setMetadataChecksummer, setMetadataDeduplicate, setMetadataFilters, setMetadataFilters, setNumThreads, setOrphansStrategy, setReferenceFilters, setReferenceFilters, setSpoiledReferenceStrategizer, setStopOnExceptions, setStopOnExceptions
-
-
-
-
Method Detail
-
isFetchHttpHead
@Deprecated public boolean isFetchHttpHead()
Deprecated.UsegetFetchHttpHead()
.Deprecated.- Returns:
true
if fetching HTTP response headers separately- Since:
- 3.0.0-M1
-
setFetchHttpHead
@Deprecated public void setFetchHttpHead(boolean fetchHttpHead)
Deprecated.Deprecated.- Parameters:
fetchHttpHead
-true
if fetching HTTP response headers separately- Since:
- 3.0.0-M1
-
getFetchHttpHead
public HttpCrawlerConfig.HttpMethodSupport getFetchHttpHead()
Gets whether to fetch HTTP response headers using an HTTP HEAD request. That HTTP request is performed separately from a document download request (HTTP "GET"). Useful when you need to filter documents based on HTTP header values, without downloading them first (e.g., to save bandwidth). When dealing with small documents on average, it may be best to avoid issuing two requests when a single one could do it.
HttpCrawlerConfig.HttpMethodSupport.DISABLED
by default. See class documentation for more details.- Returns:
- HTTP HEAD method support
- Since:
- 3.0.0
-
setFetchHttpHead
public void setFetchHttpHead(HttpCrawlerConfig.HttpMethodSupport fetchHttpHead)
Sets whether to fetch HTTP response headers using an HTTP HEAD request.
See class documentation for more details.
- Parameters:
fetchHttpHead
- HTTP HEAD method support- Since:
- 3.0.0
-
getFetchHttpGet
public HttpCrawlerConfig.HttpMethodSupport getFetchHttpGet()
Gets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.
HttpCrawlerConfig.HttpMethodSupport.REQUIRED
by default. See class documentation for more details.- Returns:
true
if fetching HTTP response headers separately- Since:
- 3.0.0
-
setFetchHttpGet
public void setFetchHttpGet(HttpCrawlerConfig.HttpMethodSupport fetchHttpGet)
Sets whether to fetch HTTP documents using an HTTP GET request. Requests made using the HTTP GET method are usually required to download a document and have its content extracted and links discovered. It should never be disabled unless you have an exceptional use case.
See class documentation for more details.
- Parameters:
fetchHttpGet
-true
if fetching HTTP response headers separately- Since:
- 3.0.0
-
getStartURLs
public List<String> getStartURLs()
Gets URLs to initiate crawling from.- Returns:
- start URLs (never
null
)
-
setStartURLs
public void setStartURLs(String... startURLs)
Sets URLs to initiate crawling from.- Parameters:
startURLs
- start URLs
-
setStartURLs
public void setStartURLs(List<String> startURLs)
Sets URLs to initiate crawling from.- Parameters:
startURLs
- start URLs- Since:
- 3.0.0
-
getStartURLsFiles
public List<Path> getStartURLsFiles()
Gets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.- Returns:
- file paths of seed files containing URLs
(never
null
) - Since:
- 2.3.0
-
setStartURLsFiles
public void setStartURLsFiles(Path... startURLsFiles)
Sets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.- Parameters:
startURLsFiles
- file paths of seed files containing URLs- Since:
- 2.3.0
-
setStartURLsFiles
public void setStartURLsFiles(List<Path> startURLsFiles)
Sets the file paths of seed files containing URLs to be used as "start URLs". Files are expected to have one URL per line. Blank lines and lines starting with # (comment) are ignored.- Parameters:
startURLsFiles
- file paths of seed files containing URLs- Since:
- 3.0.0
-
getStartSitemapURLs
public List<String> getStartSitemapURLs()
Gets sitemap URLs to be used as starting points for crawling.- Returns:
- sitemap URLs (never
null
) - Since:
- 2.3.0
-
setStartSitemapURLs
public void setStartSitemapURLs(String... startSitemapURLs)
Sets the sitemap URLs used as starting points for crawling.- Parameters:
startSitemapURLs
- sitemap URLs- Since:
- 2.3.0
-
setStartSitemapURLs
public void setStartSitemapURLs(List<String> startSitemapURLs)
Sets the sitemap URLs used as starting points for crawling.- Parameters:
startSitemapURLs
- sitemap URLs- Since:
- 3.0.0
-
getStartURLsProviders
public List<IStartURLsProvider> getStartURLsProviders()
Gets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.- Returns:
- start URL providers (never
null
) - Since:
- 2.7.0
-
setStartURLsProviders
public void setStartURLsProviders(IStartURLsProvider... startURLsProviders)
Sets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.- Parameters:
startURLsProviders
- start URL provider- Since:
- 2.7.0
-
setStartURLsProviders
public void setStartURLsProviders(List<IStartURLsProvider> startURLsProviders)
Sets the providers of URLs used as starting points for crawling. Use this approach over other methods when URLs need to be provided dynamicaly at launch time. URLs obtained by a provider are combined with start URLs provided through other methods.- Parameters:
startURLsProviders
- start URL provider- Since:
- 3.0.0
-
isStartURLsAsync
public boolean isStartURLsAsync()
Gets whether the start URLs should be loaded asynchronously. Whentrue
, the crawler will start processing URLs in the queue even if start URLs are still being loaded. While this may speed up crawling, it may have an unexpected effect on accuracy ofHttpDocMetadata.DEPTH
. Use of this option is only recommended when start URLs takes a significant time to load (e.g., large sitemaps).- Returns:
true
if async.- Since:
- 3.0.0
-
setStartURLsAsync
public void setStartURLsAsync(boolean asyncStartURLs)
Sets whether the start URLs should be loaded asynchronously. Whentrue
, the crawler will start processing URLs in the queue even if start URLs are still being loaded. While this may speed up crawling, it may have an unexpected effect on accuracy ofHttpDocMetadata.DEPTH
. Use of this option is only recommended when start URLs takes a significant time to load (e.g., large sitemaps).- Parameters:
asyncStartURLs
-true
if async.- Since:
- 3.0.0
-
setMaxDepth
public void setMaxDepth(int depth)
-
getMaxDepth
public int getMaxDepth()
-
getHttpFetchers
public List<IHttpFetcher> getHttpFetchers()
Gets HTTP fetchers.- Returns:
- start URLs (never
null
) - Since:
- 3.0.0
-
setHttpFetchers
public void setHttpFetchers(IHttpFetcher... httpFetchers)
Sets HTTP fetchers.- Parameters:
httpFetchers
- list of HTTP fetchers- Since:
- 3.0.0
-
setHttpFetchers
public void setHttpFetchers(List<IHttpFetcher> httpFetchers)
Sets HTTP fetchers.- Parameters:
httpFetchers
- list of HTTP fetchers- Since:
- 3.0.0
-
getHttpFetchersMaxRetries
public int getHttpFetchersMaxRetries()
Gets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures. Default is zero (won't retry).- Returns:
- number of times
- Since:
- 3.0.0
-
setHttpFetchersMaxRetries
public void setHttpFetchersMaxRetries(int httpFetchersMaxRetries)
Sets the maximum number of times an HTTP fetcher will re-attempt fetching a resource in case of failures.- Parameters:
httpFetchersMaxRetries
- maximum number of retries- Since:
- 3.0.0
-
getHttpFetchersRetryDelay
public long getHttpFetchersRetryDelay()
Gets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds). Default is zero (no delay).- Returns:
- retry delay
- Since:
- 3.0.0
-
setHttpFetchersRetryDelay
public void setHttpFetchersRetryDelay(long httpFetchersRetryDelay)
Sets how long to wait before a failing HTTP fetcher re-attempts fetching a resource in case of failures (in milliseconds).- Parameters:
httpFetchersRetryDelay
- retry delay- Since:
- 3.0.0
-
getCanonicalLinkDetector
public ICanonicalLinkDetector getCanonicalLinkDetector()
Gets the canonical link detector.- Returns:
- the canonical link detector, or
null
if none are defined. - Since:
- 2.2.0
-
setCanonicalLinkDetector
public void setCanonicalLinkDetector(ICanonicalLinkDetector canonicalLinkDetector)
Sets the canonical link detector. To disable canonical link detection, either pass anull
argument, or invokesetIgnoreCanonicalLinks(boolean)
with atrue
value.- Parameters:
canonicalLinkDetector
- the canonical link detector- Since:
- 2.2.0
-
getLinkExtractors
public List<ILinkExtractor> getLinkExtractors()
Gets link extractors.- Returns:
- link extractors
-
setLinkExtractors
public void setLinkExtractors(ILinkExtractor... linkExtractors)
Sets link extractors.- Parameters:
linkExtractors
- link extractors
-
setLinkExtractors
public void setLinkExtractors(List<ILinkExtractor> linkExtractors)
Sets link extractors.- Parameters:
linkExtractors
- link extractors- Since:
- 3.0.0
-
getRobotsTxtProvider
public IRobotsTxtProvider getRobotsTxtProvider()
-
setRobotsTxtProvider
public void setRobotsTxtProvider(IRobotsTxtProvider robotsTxtProvider)
-
getUrlNormalizer
@Deprecated(forRemoval=true, since="3.1.0") public IURLNormalizer getUrlNormalizer()
Deprecated, for removal: This API element is subject to removal in a future version.Since 3.1.0, usegetUrlNormalizers()
instead.- Returns:
- URL normalizer
-
setUrlNormalizer
@Deprecated(forRemoval=true, since="3.1.0") public void setUrlNormalizer(IURLNormalizer urlNormalizer)
Deprecated, for removal: This API element is subject to removal in a future version.Since 3.1.0, usesetUrlNormalizers(List)
instead.- Parameters:
urlNormalizer
- URL normalizer
-
getUrlNormalizers
public List<IURLNormalizer> getUrlNormalizers()
Gets URL normalizers. Defaults to a singleGenericURLNormalizer
instance (with its default configuration).- Returns:
- URL normalizers or an empty list (never
null
) - Since:
- 3.1.0
-
setUrlNormalizers
public void setUrlNormalizers(List<IURLNormalizer> urlNormalizers)
Sets URL normalizers.- Parameters:
urlNormalizers
- URL normalizers- Since:
- 3.1.0
-
getDelayResolver
public IDelayResolver getDelayResolver()
-
setDelayResolver
public void setDelayResolver(IDelayResolver delayResolver)
-
getPreImportProcessors
public List<IHttpDocumentProcessor> getPreImportProcessors()
Gets pre-import processors.- Returns:
- pre-import processors
-
setPreImportProcessors
public void setPreImportProcessors(IHttpDocumentProcessor... preImportProcessors)
Sets pre-import processors.- Parameters:
preImportProcessors
- pre-import processors
-
setPreImportProcessors
public void setPreImportProcessors(List<IHttpDocumentProcessor> preImportProcessors)
Sets pre-import processors.- Parameters:
preImportProcessors
- pre-import processors- Since:
- 3.0.0
-
getPostImportProcessors
public List<IHttpDocumentProcessor> getPostImportProcessors()
Gets post-import processors.- Returns:
- post-import processors
-
setPostImportProcessors
public void setPostImportProcessors(IHttpDocumentProcessor... postImportProcessors)
Sets post-import processors.- Parameters:
postImportProcessors
- post-import processors
-
setPostImportProcessors
public void setPostImportProcessors(List<IHttpDocumentProcessor> postImportProcessors)
Sets post-import processors.- Parameters:
postImportProcessors
- post-import processors- Since:
- 3.0.0
-
isIgnoreRobotsTxt
public boolean isIgnoreRobotsTxt()
-
setIgnoreRobotsTxt
public void setIgnoreRobotsTxt(boolean ignoreRobotsTxt)
-
isKeepDownloads
public boolean isKeepDownloads()
-
setKeepDownloads
public void setKeepDownloads(boolean keepDownloads)
-
isKeepOutOfScopeLinks
@Deprecated public boolean isKeepOutOfScopeLinks()
Deprecated.Since 3.0.0, usegetKeepReferencedLinks()
.Whether links not in scope should be stored as metadata underHttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE
- Returns:
true
if keeping URLs not in scope.- Since:
- 2.8.0
-
setKeepOutOfScopeLinks
@Deprecated public void setKeepOutOfScopeLinks(boolean keepOutOfScopeLinks)
Deprecated.Since 3.0.0, usesetKeepReferencedLinks(Set)
.Sets whether links not in scope should be stored as metadata underHttpDocMetadata.REFERENCED_URLS_OUT_OF_SCOPE
- Parameters:
keepOutOfScopeLinks
-true
if keeping URLs not in scope- Since:
- 2.8.0
-
getKeepReferencedLinks
public Set<HttpCrawlerConfig.ReferencedLinkType> getKeepReferencedLinks()
Gets what type of referenced links to keep, if any. Those links are URLs extracted by link extractors. See class documentation for more details.- Returns:
- preferences for keeping links
- Since:
- 3.0.0
-
setKeepReferencedLinks
public void setKeepReferencedLinks(Set<HttpCrawlerConfig.ReferencedLinkType> keepReferencedLinks)
Sets whether to keep referenced links and what to keep. Those links are URLs extracted by link extractors. See class documentation for more details.- Parameters:
keepReferencedLinks
- option for keeping links- Since:
- 3.0.0
-
setKeepReferencedLinks
public void setKeepReferencedLinks(HttpCrawlerConfig.ReferencedLinkType... keepReferencedLinks)
Sets whether to keep referenced links and what to keep. Those links are URLs extracted by link extractors. See class documentation for more details.- Parameters:
keepReferencedLinks
- option for keeping links- Since:
- 3.0.0
-
isIgnoreRobotsMeta
public boolean isIgnoreRobotsMeta()
-
setIgnoreRobotsMeta
public void setIgnoreRobotsMeta(boolean ignoreRobotsMeta)
-
getRobotsMetaProvider
public IRobotsMetaProvider getRobotsMetaProvider()
-
setRobotsMetaProvider
public void setRobotsMetaProvider(IRobotsMetaProvider robotsMetaProvider)
-
isIgnoreSitemap
public boolean isIgnoreSitemap()
Whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()
) are never ignored.- Returns:
true
to ignore sitemaps
-
setIgnoreSitemap
public void setIgnoreSitemap(boolean ignoreSitemap)
Sets whether to ignore sitemap detection and resolving for URLs processed. Sitemaps specified as start URLs (getStartSitemapURLs()
) are never ignored.- Parameters:
ignoreSitemap
-true
to ignore sitemaps
-
getSitemapResolver
public ISitemapResolver getSitemapResolver()
-
setSitemapResolver
public void setSitemapResolver(ISitemapResolver sitemapResolver)
-
isIgnoreCanonicalLinks
public boolean isIgnoreCanonicalLinks()
Whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. When processed (default), URL pages with a canonical URL pointer in them are not processed.- Returns:
true
if ignoring canonical links processed.- Since:
- 2.2.0
-
setIgnoreCanonicalLinks
public void setIgnoreCanonicalLinks(boolean ignoreCanonicalLinks)
Sets whether canonical links found in HTTP headers and in HTML files <head> section should be ignored or processed. Iftrue
URL pages with a canonical URL pointer in them are not- Parameters:
ignoreCanonicalLinks
-true
if ignoring canonical links- Since:
- 2.2.0
-
getURLCrawlScopeStrategy
public URLCrawlScopeStrategy getURLCrawlScopeStrategy()
Gets the strategy to use to determine if a URL is in scope.- Returns:
- the strategy
-
setUrlCrawlScopeStrategy
public void setUrlCrawlScopeStrategy(URLCrawlScopeStrategy urlCrawlScopeStrategy)
Sets the strategy to use to determine if a URL is in scope.- Parameters:
urlCrawlScopeStrategy
- strategy to use- Since:
- 2.8.1
-
getRecrawlableResolver
public IRecrawlableResolver getRecrawlableResolver()
Gets the recrawlable resolver.- Returns:
- recrawlable resolver
- Since:
- 2.5.0
-
setRecrawlableResolver
public void setRecrawlableResolver(IRecrawlableResolver recrawlableResolver)
Sets the recrawlable resolver.- Parameters:
recrawlableResolver
- the recrawlable resolver- Since:
- 2.5.0
-
getPostImportLinks
public TextMatcher getPostImportLinks()
Gets a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.- Returns:
- field matcher
- Since:
- 3.0.0
-
setPostImportLinks
public void setPostImportLinks(TextMatcher fieldMatcher)
Set a field matcher used to identify post-import metadata fields holding URLs to consider for crawling.- Parameters:
fieldMatcher
- field matcher- Since:
- 3.0.0
-
isPostImportLinksKeep
public boolean isPostImportLinksKeep()
Gets whether to keep the importer-generated field holding URLs to consider for crawling.- Returns:
true
if keeping- Since:
- 3.0.0
-
setPostImportLinksKeep
public void setPostImportLinksKeep(boolean postImportLinksKeep)
Sets whether to keep the importer-generated field holding URLs to consider for crawling.- Parameters:
postImportLinksKeep
-true
if keeping- Since:
- 3.0.0
-
saveCrawlerConfigToXML
protected void saveCrawlerConfigToXML(XML xml)
- Specified by:
saveCrawlerConfigToXML
in classCrawlerConfig
-
loadCrawlerConfigFromXML
protected void loadCrawlerConfigFromXML(XML xml)
- Specified by:
loadCrawlerConfigFromXML
in classCrawlerConfig
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classCrawlerConfig
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classCrawlerConfig
-
toString
public String toString()
- Overrides:
toString
in classCrawlerConfig
-
-