Norconex Web Crawler

Configuration

Configuration Quick start

To get started quickly, download the latest version of Norconex Web Crawler and locate the file ./examples/HOWTO_RUN_EXAMPLES.txt. This file will point you to functional configuration files and will have you running a very simple crawler in no time. See how you run these examples here. They are replicated here for your convinience:

Web Crawler Configuration Options

To get the full potential of Norconex Web Crawler and learn which parts can easily be extended, refer to the following for an XML-based configuration. Entries with a "class" attribute are expecting an implementation of your choice. The Web Crawler API offers several concrete implementations already. Configuration options with a default value does not have to be defined. Developers can also create their own by implementing the proper Java interfaces. Refer to the Norconex Web Crawler JavaDoc and/or see further down what interfaces you can implement. Go to the Extend the Web Crawler page for more details on adding your own implementations.

Click on a tag name to jump to/from its documentation.

<httpcollector id="...">
 
    <progressDir>...</progressDir>
    <logsDir>...</logsDir>
    <collectorListeners>
        <listener class="..."/>
        ...
    </collectorListeners>
 
    <crawlerDefaults>
        <!-- All crawler options defined below (except for the crawler "id") 
             can be set here as default shared between multiple crawlers. 
             Configuration blocks defined for a specific crawler always takes
             precendence. -->
    </crawlerDefaults>
 
    <crawlers>
        <!-- You need to define at least one crawler. -->
        <crawler id="...">
            <startURLs stayOnDomain="..." includeSubdomains="..." stayOnPort="..." stayOnProtocol="...">
                <url>...</url>
                <urlsFile>...</urlsFile>
                <sitemap>...</sitemap>
                <provider class="..."/>
                ...
            </startURLs>
            <userAgent>...</userAgent>
            <urlNormalizer class="..." />
            <delay class="..." />
            <numThreads>...</numThreads>
            <maxDepth>...</maxDepth>
            <maxDocuments>...</maxDocuments>
            <workDir>...</workDir>
            <keepDownloads>...</keepDownloads>
            <keepOutOfScopeLinks>...</keepOutOfScopeLinks>
            <orphansStrategy>...</orphansStrategy>
            <stopOnExceptions>
              <exception>...</exception>
              ...
            </stopOnExceptions>
            <crawlerListeners>
                <listener class="..."/>
                ...
            </crawlerListeners>
            <crawlDataStoreFactory class="" />
            <httpClientFactory class="..." />
            <referenceFilters>
                <filter class="..." />
                ...
            </referenceFilters>
            <robotsTxt ignore="..." class="..."/>
            <sitemapResolverFactory ignore="..." class="...">
               ...
            </sitemapResolverFactory>
            <redirectURLProvider class="..." />
            <recrawlableResolver class="..." />
            <metadataFetcher class="..." />
            <metadataFilters>
                <filter class="..." />
                ...
            </metadataFilters>
            <canonicalLinkDetector ignore="..." class="..." />
            <metadataChecksummer class="..." />
            <documentFetcher class="..." />
            <robotsMeta ignore="..." class="..."/>
            <linkExtractors>
                <extractor class="..." />
            </linkExtractors>
            <documentFilters>
                <filter class="..." />
                ...
            </documentFilters>
            <preImportProcessors>
                <processor class="..." />
                ...
            </preImportProcessors>
 
            <importer>
                <!-- refer to Importer documentation -->
            </importer>
 
            <documentChecksummer class="..." />
 
            <postImportProcessors>
              <processor class="..."></processor>
            </postImportProcessors>
 
            <spoiledReferenceStrategizer class="..." />
 
            <committer class="...">
                <!-- refer to Committer documentation -->
            </committer>		
        </crawler>
        ...
    </crawlers>
 
</httpcollector>

The table below lists interface names that you can easily extend, and also lists available out-of-the-box implementations. In the configuration file, you have to use the fully qualified name, as defined in the Javadoc. Click on a class or interface name to go directly to its full documentation, with extra configuration options. When a default implementation exists for a configuration option taking a class attribute, it is highlighted.

Tag Description Classes Interface
httpcollector Root tag, you must give your configuration a unique identifier value. N/A N/A
progressDir Directory where to store crawling progress files. Default is "./progress". N/A N/A
logsDir Directory where crawl logs will be stored. Default is "./logs". N/A N/A
collectorListeners Listen to collector events. ICollectorLifeCycleListener
startURLs URLs to start crawling from. Can be one or several of <url>, <urlsFile> (file containing URLs), <sitemap> (URL to a sitemap.xml file) or <provider> (for a dynamically generated list). N/A IStartURLsProvider (<provider> only)
userAgent The crawler "User-Agent" value to identify your crawler to sites you crawl. N/A N/A
urlNormalizer Normalizes incoming URLs. GenericURLNormalizer IURLNormalizer
delay Handles interval between each page download. GenericDelayResolver, ReferenceDelayResolver IDelayResolver
numThreads Number of execution threads for a crawler. Default is 2. N/A N/A
maxDepth How many level deep to crawl from start URL(s). Default is -1 (unlimited). N/A N/A
maxDocuments Maximum documents to successfully process. Default is -1 (unlimited). N/A N/A
workDir Where to store files created as part of crawling activies. Default is "./work". N/A N/A
keepDownloads Whether to keep downloaded files. Defaut is false. N/A N/A
keepOutOfScopeLinks Keep extracted links that are out-of-scope according to start URL stayOn... flags. Defaut is false. N/A N/A
orphansStrategy What to do with urls not being referenced anymore. PROCESS (default), IGNORE, or DELETE. N/A N/A
stopOnExceptions What exception(s) should force a crawler to stop when triggered during the processing of a document. N/A N/A
crawlerListeners Listen to crawling events. URLStatusCrawlerEventListener ICrawlerEventListener
crawlDataStoreFactory URLs and crawl-related information data store. MVStoreCrawlDataStoreFactory, JDBCCrawlDataStoreFactory, MongoCrawlDataStoreFactory ICrawlDataStoreFactory
httpClientFactory HTTP Client creation and initialization. GenericHttpClientFactory IHttpClientFactory
referenceFilters Filter based on references (i.e. URLs). ExtensionReferenceFilter, RegexReferenceFilter, SegmentCountURLFilter IReferenceFilter
robotsTxt Handle robots.txt files. StandardRobotsTxtProvider IRobotsTxtProvider
sitemapResolverFactory Handle Sitemap files. StandardSitemapResolverFactory ISitemapResolverFactory
redirectURLProvider Provides the target URL to use when a redirect is encountered. GenericRedirectURLProvider IRedirectURLProvider
recrawlableResolver Indicates if a target URL is ready for recrawl or not. GenericRecrawlableResolver IRecrawlableResolver
metadataFetcher Fetches HTTP headers for a URL. GenericMetadataFetcher IHttpMetadataFetcher
metadataFilters Filter based on HTTP Headers. ExtensionReferenceFilter, RegexReferenceFilter, RegexMetadataFilter, SegmentCountURLFilter IMetadataFilter
canonicalLinkDetector Detects pages with a canonical link and rejects them in favor of the canonical one. GenericCanonicalLinkDetector ICanonicalLinkDetector
metadataChecksummer Create document checksum from HTTP Headers. LastModifiedMetadataChecksummer, GenericMetadataChecksummer IMetadataChecksummer
documentFetcher Fetch a document from URL. GenericDocumentFetcher,
PhantomJSDocumentFetcher
IHttpDocumentFetcher
robotsMeta Handle in-page robots instructions. StandardRobotsMetaProvider IRobotsMetaProvider
linkExtractors One or more <extractor> for extracting URLs and other link data from a document. GenericLinkExtractor,
TikaLinkExtractor,
RegexLinkExtractor,
XMLFeedLinkExtractor
ILinkExtractor
documentFilters Filter documents (after links were extracted). ExtensionReferenceFilter,
RegexReferenceFilter,
RegexMetadataFilter,
SegmentCountURLFilter
IDocumentFilter
preImportProcessors Process a document before import. FeaturedImageProcessor IHttpDocumentProcessor
importer Performs document text extraction and manipulation. It has many features and many file formats are supported. Refer to Importer configuration options.
documentChecksummer Create a checksum from document. MD5DocumentChecksummer IDocumentChecksummer
postImportProcessors Process a document after import. IHttpDocumentProcessor
spoiledReferenceStrategizer Establish the strategy to adopt for references that have turned bad. GenericSpoiledReferenceStrategizer ISpoiledReferenceStrategizer
committer Where to commit a document when processed. Different implementations are available. Check out the list of available Committers.
crawler Define as many crawlers as you like. They must each have a unique identifier. N/A N/A

Importer Configuration Options

The Importer module is an integral part of the Web Crawler. It is reponsible for extracting text out of documents, and also provide document manipulation options and filtering options. Much more is found in this module distributed with the Web Crawler. Read the Importer Configuration Options.

Committer Configuration Options

The Committer module is responsible for taking the text extracted out of your collected documents and submit it to your target repository (e.g. search engine). Make sure you download a Committer implementation matching your target repository. Configuration options are specific to each committers. Refer to your committer documentation.

More Options

There is a lot more you can do to structure your configuration files the way you like. Refer to this additional documentation for more configuration options such as creating reusable configuration fragments and using variables to make your files easier to maintain and more portable across different environments.