If you have not done so already, it is recommended you first familiarize yourself with the general XML configuration guidelines described in the online Norconex Web Crawler Manual.
The following are the available XML configuration options. Click on an expandable tag to get relevant documentation and more configuration options.
<crawler>
configuration options (except for the crawler "id") can be set here as default to prevent repeating them. Settings defined here will be inherited by all individual crawlers defined further down, unless overwritten. Configuration blocks defined for a specific crawler always takes precendence. If you overwrite a top level crawler tag from the crawler defaults, all the default tag configuration settings will be replaced (no attempt will be made to merge or append). An alternative to using this, you can dynamically include configuration fragments.<crawlerDefaults>
configuration settings will apply to all crawlers created unless explicitly overwritten in crawler configuration. For configuration options where multiple items can be present (e.g. filters), the entire list set in <crawlerDefaults>
would be overwritten.#
are ignored.<delay/>
option. Using more than one thread is a good idea to ensure the delay is respected in case you run into downloads taking more time than the configured delay.<fetchHttpHead>
.<workDir>/downloads
.<startURLs>
condition attributes (e.g., stayOnDomain
, stayOnPort
, ..."). Stored in collector.referenced-urls
field.<startURLs>
condition attributes) in collector.referenced-urls-out-of-scope
field.<maxDepth>
. Must be used with at least one other option to have any effect.<handler>
elements applied to imported documents in their original format BEFORE their parsing has occurred. Can be mixed with XML-based condition wrappers to create a processing "flow" (if
, ifNot
).<handler>
s if a condition (or group of conditions) returns true
. Must contain exactly one of <conditions>
or <condition>
as a direct child element, followed by exactly one <then>
, and optionally one <else>
.<condition>
or <conditions>
together.<condition>
(under <if>
) for available options.<conditions>
(under <if>
) for available options.<if>
and <ifNot>
.<handler>
(under <preParseHandlers>
) for available options.<if>
(under <preParseHandlers>
) for available options.<if>
and <ifNot>
.<handler>
(under <preParseHandlers>
) for available options.<if>
(under <preParseHandlers>
) for available options.<handler>
elements applied to imported documents AFTER their parsing has occurred and their raw text extracted.<preParseHandlers>
. Refer to previously documented <preParseHandlers>
(under <importer>
) for available options.