Configuration

When stayOnDomain is true, whether to also include subdomains of a start URL domain (e.g., support.example.com).

includeSubdomains="..."

Whether to only follow links on the same port as a start URL (e.g., 80, 443).

stayOnPort="..."

Whether to only follow links sharing the same URL protocol as a start URL (e.g., "https", "http").

stayOnProtocol="..."

Whether to start process start URLs as they are discovered. Useful when dealing with large sitemaps, URL files, etc.

async="..."

Required

Start URLs tell the crawler where to start crawling. You need to provide one or more of the following (any combination is fine).

<url>

Repeatable

A regular URL.

</url>

Repeatable

Local path to a text file containing a list of URLs (one per line). Lines starting with # are ignored.

</urlsFile>

Repeatable

URL pointing to a sitemap.xml file.

</sitemap>

<provider

Custom class to provide a list of start URLs dynamically.

Required

Repeatable

Interface:

IStartURLsProvider

</provider>

</startURLs>

<urlNormalizer

GenericURLNormalizer

Normalize encountered URLs so equivalent URLs are only processed once.

Interface:

IURLNormalizer

Classes:

GenericURLNormalizer

</urlNormalizer>

<delay

URLStatusCrawlerEventListener

GenericDelayResolver

Minimum amount of time that must pass between each page download. Default is 3 seconds. Be nice!

Interface:

IDelayResolver

Classes:

GenericDelayResolver

ReferenceDelayResolver

</delay>

How many threads you want a crawler to use. Regardless of how many threads you have running, the URL download frequency will remain dictated by the <delay/> option. Using more than one thread is a good idea to ensure the delay is respected in case you run into downloads taking more time than the configured delay.

Default:

</numThreads>

How many level deep can the crawler go. This is equivalent to how many user "clicks" away from the main page (start URL) each page can be in ordered to be considered. Beyond the depth specified, pages are rejected. Start URLs all have a zero-depth.

Default:

-1 (unlimited)

</maxDepth>

Stop crawling after that many references were processed. A processed reference is one that was read from the crawler queue in an attempt to fetch its corresponding document, whether that attempt was successful or not. For additional control on stopping the crawler after a certain number of events, have a look at com.norconex.collector.core.crawler.event.impl.StopCrawlerOnMaxEventListener

Default:

-1 (unlimited)

</maxDocuments>

Dictates how the crawler should handle HTTP requests using the HEAD method. Enabling this option to first obtain the HTTP response headers without downloading the document. This can be useful when relying on the obtained metadata to filter documents (saving unecessary downloads).

Possible values:

DISABLED

No HTTP call willl be made using that method.

OPTIONAL

If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document can still be processed successfully by the other HTTP method. Only relevant when both HEAD and GET are enabled.

REQUIRED

If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document will be rejected and won't go any further, even if the other HTTP method was or could have been successful. Only relevant when both HEAD and GET are enabled.

Default:

DISABLED

</fetchHttpHead>

Dictates how the crawler should handle HTTP requests using the GET method.

Default:

REQUIRED

More documentation:

Shares the same configuration options as <fetchHttpHead>.

</fetchHttpGet>

Keep downloaded files on local disk under <workDir>/downloads.

Default:

false

</keepDownloads>

Comma-separated list of type of extracted links to store in a metadata field.

Possible values:

INSCOPE

Stores only URLs matching the <startURLs> condition attributes (e.g., stayOnDomain, stayOnPort, ..."). Stored in collector.referenced-urls field.

OUTSCOPE

Stores "out-of-scope" links (those not matching <startURLs> condition attributes) in collector.referenced-urls-out-of-scope field.

MAXDEPTH

Also stores URLs extracted on pages at <maxDepth>. Must be used with at least one other option to have any effect.

Default:

INSCOPE

</keepReferencedLinks>

Action to perform on valid documents, which on subsequent crawls can no longer be reached (e.g., there are no links pointing to that page anymore).

Possible values:

PROCESS

Attempts to download and process orphan documents, as if they were encountered normally.

DELETE

Sends a deletion request for orphan documents.

IGNORE

Ignores orphan documents (do nothing).

Default:

PROCESS

</orphansStrategy>

Repeatable

A fully qualified name of a Java exception that should force a crawler to stop when triggered during the processing of a document. Leave blank to have the crawler attempt to continue.

</exception>

</stopOnExceptions>

<listener

class="..."

Required

Repeatable

One or more optional listeners to be notified on crawler events (e.g. document rejected, document imported, etc.).

Interface:

IEventListener

Classes:

</listener>

</eventListeners>

<dataStoreEngine

Number of times to retry a failed fetch attempt.

MVStoreDataStoreEngine

Database implementation for storing crawl data and other information.

Interface:

IDataStoreEngine

Classes:

MVStoreDataStoreEngine

JdbcDataStoreEngine

MongoDataStoreEngine

</dataStoreEngine>

<httpFetchers

maxRetries="..."

Number of milliseconds to wait between each attempt.

retryDelay="..."

<fetcher

class="..."

Required

Repeatable

Responsible for making HTTP requests and fetching associated content. Fetchers are defined in execution order. If the first fails or does not support a given URL, the next fetcher will try to fetch it, and so on.

Default:

GenericHttpFetcher

Interface:

IHttpFetcher

Classes:

GenericHttpFetcher

WebDriverHttpFetcher

</fetcher>

</httpFetchers>

<filter

class="..."

Required

onMatch="..."

Possible values:

EXCLUDE

Excludes references matching any of the "EXCLUDE" filters. Takes precedence over "INCLUDE".

INCLUDE

Includes references matching any of the "INCLUDE" filters.

Default:

EXCLUDE

Repeatable

Filters URLs BEFORE any download.

Interface:

IReferenceFilter

Classes:

ExtensionReferenceFilter

RegexReferenceFilter

SegmentCountURLFilter

</filter>

</referenceFilters>

<robotsTxt

StandardRobotsTxtProvider

StandardRobotsTxtProvider

Filters BEFORE download based on RobotsTxt rules.

Interface:

IRobotsTxtProvider

Classes:

</robotsTxt>

<sitemapResolver

GenericSitemapResolver

Loads sitemap.xml URLs and adds them to the queue of URLs to be processed.

Interface:

ISitemapResolver

Classes:

GenericSitemapResolver

</sitemapResolver>

<recrawlableResolver

GenericRecrawlableResolver

Indicates if a target URL is ready for recrawl or not.

Interface:

IRecrawlableResolver

Classes:

GenericRecrawlableResolver

</recrawlableResolver>

<filter

class="..."

Required

onMatch="..."

Possible values:

EXCLUDE

Excludes references matching any of the "EXCLUDE" filters. Takes precedence over "INCLUDE".

INCLUDE

Includes references matching any of the "INCLUDE" filters.

Default:

EXCLUDE

Repeatable

Filters URLs AFTER download of HTTP headers.

Interface:

IMetadataFilter

Classes:

ExtensionReferenceFilter

RegexMetadataFilter

RegexReferenceFilter

SegmentCountURLFilter

</filter>

</metadataFilters>

<canonicalLinkDetector

GenericCanonicalLinkDetector

GenericCanonicalLinkDetector

Detects canonical links.

Interface:

ICanonicalLinkDetector

Classes:

</canonicalLinkDetector>

<metadataChecksummer

LastModifiedMetadataChecksummer

Generates a checksum value from document headers to find out if a document has changed since the previous crawl.

Interface:

IMetadataChecksummer

Classes:

LastModifiedMetadataChecksummer

GenericMetadataChecksummer

</metadataChecksummer>

Relies on a document metadata checksum uniqueness to detect duplicates within a crawling session and reject them. It will reject any document that has the same metadata checksum as a previously processed document.

Default:

false

</metadataDeduplicate>

<robotsMeta

StandardRobotsMetaProvider

StandardRobotsMetaProvider

Establishes whether to follow a page URLs or to index a given page based on in-page meta tag robot information.

Interface:

IRobotsMetaProvider

Classes:

</robotsMeta>

<extractor

class="..."

Required

Repeatable

Extracts links from a document.

Default:

HtmlLinkExtractor

Interface:

Classes:

</extractor>

</linkExtractors>

<filter

class="..."

Required

onMatch="..."

Possible values:

EXCLUDE

Excludes references matching any of the "EXCLUDE" filters. Takes precedence over "INCLUDE".

INCLUDE

Includes references matching any of the "INCLUDE" filters.

Default:

EXCLUDE

Repeatable

Filter documents, AFTER links were extracted.

Interface:

IDocumentFilter

Classes:

ExtensionReferenceFilter

RegexMetadataFilter

RegexReferenceFilter

SegmentCountURLFilter

</filter>

</documentFilters>

<processor

class="..."

Required

Repeatable

Process a document just BEFORE importing it.

Interface:

IHttpDocumentProcessor

Classes:

FeaturedImageProcessor

</processor>

</preImportProcessors>

The Importer is reposible for extracting raw text out of any documents, in addition to transforming, decorating, and filtering content.

Directory where file generating parsing errors are saved.

Default:

None (not saved)

</parseErrorsSaveDir>

One or a series of <handler> elements applied to imported documents in their original format BEFORE their parsing has occurred. Can be mixed with XML-based condition wrappers to create a processing "flow" (if, ifNot).

<handler

An handler class performing a specific action against a document.

Required

Repeatable

Handler applied to documents being imported.

Interface:

Classes:

ForceSingleValueTagger

DOMPreserveTransformer

ExternalTransformer

ImageTransformer

NoContentTransformer

ReduceConsecutivesTransformer

ReplaceTransformer

ScriptTransformer

StripAfterTransformer

StripBeforeTransformer

StripBetweenTransformer

</handler>

<if>

Repeatable

Used to conditionally execute one or more <handler>s if a condition (or group of conditions) returns true. Must contain exactly one of <conditions> or <condition> as a direct child element, followed by exactly one <then>, and optionally one <else>.

<condition

The type of condition to be executed

A condition to be met for the associated handler(s) to be executed.

Default:

TextCondition

Interface:

Classes:

</condition>

<conditions

operator="..."

Whether all or any child conditions must be true.

Possible values:

AND

All child conditions must be true for this condition group to be true.

Any child conditions must be true for this condition group to be true.

AND

Used to group multiple <condition> or <conditions> together.

Repeatable

More documentation:

Refer to previously documented <condition> (under <if>) for available options.

</condition>

Repeatable

More documentation:

Refer to previously documented <conditions> (under <if>) for available options.

</conditions>

<then>

Wrapper around handlers executed when the condition is met. Can also contain nested <if> and <ifNot>.

Repeatable

More documentation:

Refer to previously documented <handler> (under <preParseHandlers>) for available options.

</handler>

<if>

Repeatable

More documentation:

Refer to previously documented <if> (under <preParseHandlers>) for available options.

</if>

<ifNot>

Repeatable

More documentation:

Refer to previously documented <ifNot> (under <preParseHandlers>) for available options.

</ifNot>

</then>

<else>

Wrapper around handlers executed when the condition is not met. Can also contain nested <if> and <ifNot>.

Repeatable

More documentation:

Refer to previously documented <handler> (under <preParseHandlers>) for available options.

</handler>

<if>

Repeatable

More documentation:

Refer to previously documented <if> (under <preParseHandlers>) for available options.

</if>

<ifNot>

Repeatable

More documentation:

Refer to previously documented <ifNot> (under <preParseHandlers>) for available options.

</ifNot>

</else>

</if>

<ifNot>

Repeatable

Used to conditionally execute one or more <handler>s if a condition (or group of conditions) returns false.

More documentation:

Supports the same options as <if>. Refer to previously documented <if> (under <preParseHandlers>) for available options.

</ifNot>

</preParseHandlers>

<documentParserFactory

GenericDocumentParserFactory

Factory to select and configure document parsers to use for each content types encountered.

Default:

GenericDocumentParserFactory

Interface:

IDocumentParserFactory

Classes:

GenericDocumentParserFactory

</documentParserFactory>

One or a series of <handler> elements applied to imported documents AFTER their parsing has occurred and their raw text extracted.

More documentation:

Supports the same options as <preParseHandlers>. Refer to previously documented <preParseHandlers> (under <importer>) for available options.

</postParseHandlers>

<responseProcessor

class="..."

Required

Repeatable

One or more optional custom classes that processes an Importer response to modify it or perform other actions as required before it is returned.

Interface:

IImporterResponseProcessor

</responseProcessor>

</responseProcessors>

</importer>

<documentChecksummer

MD5DocumentChecksummer

Generates a checksum value from a document to find out if it has changed since previous crawl. Invoked right AFTER the document was imported.

Interface:

IDocumentChecksummer

Classes:

MD5DocumentChecksummer

</documentChecksummer>

Relies on a document checksum uniqueness to detect duplicates within a crawling session and reject them. It will reject any document that has the same checksum as a previously processed document.

Default:

false

</documentDeduplicate>

<processor

class="..."

Required

Repeatable

Process a document after importing and the document checksum has been established.

Interface:

IHttpDocumentProcessor

</processor>

</postImportProcessors>

<fieldMatcher

keep="..."

Default:

false

Optionally define one or more post-import fields containing URLs to be considered for crawling. The field will be deleted once read, unless "keep" is "true".

</fieldMatcher>

</postImportLinks>

<spoiledReferenceStrategizer