Norconex Web Crawler

Configuration

Web Crawler Configuration Options

If you have not done so already, it is recommended you first familiarize yourself with the general XML configuration guidelines described in the online Norconex Web Crawler Manual.

The following are the available XML configuration options. Click on an expandable tag to get relevant documentation and more configuration options.

<httpcollector
id="..."
Required
You must give your configuration a unique identifier value.
>
Collector working directory. Can be an absolute path or a path relative to the process current directory. This is where files downloaded or created as part of crawling activities get stored. Sub-directories matching the collector and crawler ids will be automatically created.
Default:
./work
Directory where generated files are written.
Default:
<workDir>/temp
<listener
class="..."
Required
>
Repeatable
One or more optional listeners to be notified on collector events (e.g. start, finish, error, etc.).
Interface:
</listener>
Maximum number of crawlers to run at once. Only useful when you have multiple crawlers defined. Default runs all crawlers simultaneously.
Default:
-1
Millisecond interval between each crawlers start. Only applicable when configuring multiple crawlers. Default has no delay between each crawler start.
Default:
0
Maximum number of bytes used for memory caching of data for all documents currently being processed by the Collector.
Default:
1 GB
Maximum number of bytes used for memory caching of data for a single document being processed by the Collector.
Default:
100 MB
All <crawler> configuration options (except for the crawler "id") can be set here as default to prevent repeating them. Settings defined here will be inherited by all individual crawlers defined further down, unless overwritten. Configuration blocks defined for a specific crawler always takes precendence. If you overwrite a top level crawler tag from the crawler defaults, all the default tag configuration settings will be replaced (no attempt will be made to merge or append). An alternative to using this, you can dynamically include configuration fragments.
<crawler
id="..."
Required
Each crawler must have an "id" attribute that uniquely identifies it.
>
Repeatable
Individual crawlers are defined here. All <crawlerDefaults> configuration settings will apply to all crawlers created unless explicitly overwritten in crawler configuration. For configuration options where multiple items can be present (e.g. filters), the entire list set in <crawlerDefaults> would be overwritten.
Normalize encountered URLs so equivalent URLs are only processed once.
Interface:
Minimum amount of time that must pass between each page download. Default is 3 seconds. Be nice!
Interface:
How many threads you want a crawler to use. Regardless of how many threads you have running, the URL download frequency will remain dictated by the <delay/> option. Using more than one thread is a good idea to ensure the delay is respected in case you run into downloads taking more time than the configured delay.
Default:
2
How many level deep can the crawler go. This is equivalent to how many user "clicks" away from the main page (start URL) each page can be in ordered to be considered. Beyond the depth specified, pages are rejected. Start URLs all have a zero-depth.
Default:
-1 (unlimited)
Stop crawling after that many references were processed. A processed reference is one that was read from the crawler queue in an attempt to fetch its corresponding document, whether that attempt was successful or not. For additional control on stopping the crawler after a certain number of events, have a look at com.norconex.collector.core.crawler.event.impl.StopCrawlerOnMaxEventListener
Default:
-1 (unlimited)
Dictates how the crawler should handle HTTP requests using the HEAD method. Enabling this option to first obtain the HTTP response headers without downloading the document. This can be useful when relying on the obtained metadata to filter documents (saving unecessary downloads).
Possible values:
DISABLED
No HTTP call willl be made using that method.
OPTIONAL
If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document can still be processed successfully by the other HTTP method. Only relevant when both HEAD and GET are enabled.
REQUIRED
If the HTTP method is not supported by any fetcher or the HTTP request for it was not successful, the document will be rejected and won't go any further, even if the other HTTP method was or could have been successful. Only relevant when both HEAD and GET are enabled.
Default:
DISABLED
Dictates how the crawler should handle HTTP requests using the GET method.
Default:
REQUIRED
More documentation:
Shares the same configuration options as <fetchHttpHead>.
Keep downloaded files on local disk under <workDir>/downloads.
Default:
false
Action to perform on valid documents, which on subsequent crawls can no longer be reached (e.g., there are no links pointing to that page anymore).
Possible values:
PROCESS
Attempts to download and process orphan documents, as if they were encountered normally.
DELETE
Sends a deletion request for orphan documents.
IGNORE
Ignores orphan documents (do nothing).
Default:
PROCESS
<exception>
Repeatable
A fully qualified name of a Java exception that should force a crawler to stop when triggered during the processing of a document. Leave blank to have the crawler attempt to continue.
</exception>
<listener
class="..."
Required
>
Repeatable
One or more optional listeners to be notified on crawler events (e.g. document rejected, document imported, etc.).
Interface:
</listener>
<fetcher
class="..."
Required
>
Repeatable
Responsible for making HTTP requests and fetching associated content. Fetchers are defined in execution order. If the first fails or does not support a given URL, the next fetcher will try to fetch it, and so on.
Default:
GenericHttpFetcher
Interface:
</fetcher>
<filter
class="..."
Required
onMatch="..."
Possible values:
EXCLUDE
Excludes references matching any of the "EXCLUDE" filters. Takes precedence over "INCLUDE".
INCLUDE
Includes references matching any of the "INCLUDE" filters.
Default:
EXCLUDE
>
Repeatable
Filters URLs BEFORE any download.
Interface:
</filter>
Loads sitemap.xml URLs and adds them to the queue of URLs to be processed.
Interface:
<filter
class="..."
Required
onMatch="..."
Possible values:
EXCLUDE
Excludes references matching any of the "EXCLUDE" filters. Takes precedence over "INCLUDE".
INCLUDE
Includes references matching any of the "INCLUDE" filters.
Default:
EXCLUDE
>
Repeatable
Filters URLs AFTER download of HTTP headers.
Interface:
</filter>
Generates a checksum value from document headers to find out if a document has changed since the previous crawl.
Relies on a document metadata checksum uniqueness to detect duplicates within a crawling session and reject them. It will reject any document that has the same metadata checksum as a previously processed document.
Default:
false
Establishes whether to follow a page URLs or to index a given page based on in-page meta tag robot information.
Interface:
<extractor
class="..."
Required
>
Repeatable
Extracts links from a document.
Default:
HtmlLinkExtractor
Interface:
</extractor>
<filter
class="..."
Required
onMatch="..."
Possible values:
EXCLUDE
Excludes references matching any of the "EXCLUDE" filters. Takes precedence over "INCLUDE".
INCLUDE
Includes references matching any of the "INCLUDE" filters.
Default:
EXCLUDE
>
Repeatable
Filter documents, AFTER links were extracted.
Interface:
</filter>
<processor
class="..."
Required
>
Repeatable
Process a document just BEFORE importing it.
</processor>
<importer>
The Importer is reposible for extracting raw text out of any documents, in addition to transforming, decorating, and filtering content.
Directory where file generating parsing errors are saved.
Default:
None (not saved)
One or a series of <handler> elements applied to imported documents in their original format BEFORE their parsing has occurred. Can be mixed with XML-based condition wrappers to create a processing "flow" (if, ifNot).
Repeatable
Used to conditionally execute one or more <handler>s if a condition (or group of conditions) returns true. Must contain exactly one of <conditions> or <condition> as a direct child element, followed by exactly one <then>, and optionally one <else>.
Used to group multiple <condition> or <conditions> together.
Repeatable
More documentation:
Refer to previously documented <condition> (under <if>) for available options.
Repeatable
More documentation:
Refer to previously documented <conditions> (under <if>) for available options.
Wrapper around handlers executed when the condition is met. Can also contain nested <if> and <ifNot>.
Repeatable
More documentation:
Refer to previously documented <handler> (under <preParseHandlers>) for available options.
Repeatable
More documentation:
Refer to previously documented <if> (under <preParseHandlers>) for available options.
Repeatable
More documentation:
Refer to previously documented <ifNot> (under <preParseHandlers>) for available options.
Wrapper around handlers executed when the condition is not met. Can also contain nested <if> and <ifNot>.
Repeatable
More documentation:
Refer to previously documented <handler> (under <preParseHandlers>) for available options.
Repeatable
More documentation:
Refer to previously documented <if> (under <preParseHandlers>) for available options.
Repeatable
More documentation:
Refer to previously documented <ifNot> (under <preParseHandlers>) for available options.
Repeatable
Used to conditionally execute one or more <handler>s if a condition (or group of conditions) returns false.
More documentation:
Supports the same options as <if>. Refer to previously documented <if> (under <preParseHandlers>) for available options.
Factory to select and configure document parsers to use for each content types encountered.
Default:
GenericDocumentParserFactory
One or a series of <handler> elements applied to imported documents AFTER their parsing has occurred and their raw text extracted.
More documentation:
Supports the same options as <preParseHandlers>. Refer to previously documented <preParseHandlers> (under <importer>) for available options.
<responseProcessor
class="..."
Required
>
Repeatable
One or more optional custom classes that processes an Importer response to modify it or perform other actions as required before it is returned.
</responseProcessor>
</importer>
Generates a checksum value from a document to find out if it has changed since previous crawl. Invoked right AFTER the document was imported.
Relies on a document checksum uniqueness to detect duplicates within a crawling session and reject them. It will reject any document that has the same checksum as a previously processed document.
Default:
false
<processor
class="..."
Required
>
Repeatable
Process a document after importing and the document checksum has been established.
</processor>
<committer
class="..."
Required
>
Repeatable
Committers persists imported documents to any target data source. While not stricly required, having at least one Committer is typically essential to get anything out of your crawling.
Interface:
More documentation:
Several more classes are available for all kinds of repositories (databases, search engines, etc.). While these "add-ons" are also open-source, they have to be installed separately. You can get a full list of those on the Committers page.
</committer>
</crawler>
</httpcollector>