Norconex Web Crawler

Configuration

Getting Started Configuration Flow Diagram Javadoc Tutorials FAQ

Configuration Quick start

To get started quickly, download the latest version of Norconex Web Crawler and locate the file ./examples/HOWTO_RUN_EXAMPLES.txt. This file will point you to functional configuration files and will have you running a very simple crawler in no time. See how you run these examples here. They are replicated here for your convinience:

Web Crawler Configuration Options

To get the full potential of Norconex Web Crawler and learn which parts can easily be extended, refer to the following for an XML-based configuration. Entries with a "class" attribute are expecting an implementation of your choice. The Web Crawler API offers several concrete implementations already. Configuration options with a default value does not have to be defined. Developers can also create their own by implementing the proper Java interfaces. Refer to the Norconex Web Crawler JavaDoc and/or see further down what interfaces you can implement. Go to the Extend the Web Crawler page for more details on adding your own implementations.

Click on a tag name to jump to/from its documentation.

<httpcollector id="...">
 
    <progressDir>...</progressDir>
    <logsDir>...</logsDir>
    <collectorListeners>
        <listener class="..."/>
        ...
    </collectorListeners>
 
    <crawlerDefaults>
        <!-- All crawler options defined below (except for the crawler "id") 
             can be set here as default shared between multiple crawlers. 
             Configuration blocks defined for a specific crawler always takes
             precendence. -->
    </crawlerDefaults>
 
    <crawlers>
        <!-- You need to define at least one crawler. -->
        <crawler id="...">
            <startURLs stayOnDomain="..." includeSubdomains="..." stayOnPort="..." stayOnProtocol="...">
                <url>...</url>
                <urlsFile>...</urlsFile>
                <sitemap>...</sitemap>
                <provider class="..."/>
                ...
            </startURLs>
            <userAgent>...</userAgent>
            <urlNormalizer class="..." />
            <delay class="..." />
            <numThreads>...</numThreads>
            <maxDepth>...</maxDepth>
            <maxDocuments>...</maxDocuments>
            <workDir>...</workDir>
            <keepDownloads>...</keepDownloads>
            <keepOutOfScopeLinks>...</keepOutOfScopeLinks>
            <orphansStrategy>...</orphansStrategy>
            <stopOnExceptions>
              <exception>...</exception>
              ...
            </stopOnExceptions>
            <crawlerListeners>
                <listener class="..."/>
                ...
            </crawlerListeners>
            <crawlDataStoreFactory class="" />
            <httpClientFactory class="..." />
            <referenceFilters>
                <filter class="..." />
                ...
            </referenceFilters>
            <robotsTxt ignore="..." class="..."/>
            <sitemapResolverFactory ignore="..." class="...">
               ...
            </sitemapResolverFactory>
            <redirectURLProvider class="..." />
            <recrawlableResolver class="..." />
            <metadataFetcher class="..." />
            <metadataFilters>
                <filter class="..." />
                ...
            </metadataFilters>
            <canonicalLinkDetector ignore="..." class="..." />
            <metadataChecksummer class="..." />
            <documentFetcher class="..." />
            <robotsMeta ignore="..." class="..."/>
            <linkExtractors>
                <extractor class="..." />
            </linkExtractors>
            <documentFilters>
                <filter class="..." />
                ...
            </documentFilters>
            <preImportProcessors>
                <processor class="..." />
                ...
            </preImportProcessors>
 
            <importer>
                <!-- refer to Importer documentation -->
            </importer>
 
            <documentChecksummer class="..." />
 
            <postImportProcessors>
              <processor class="..."></processor>
            </postImportProcessors>
 
            <spoiledReferenceStrategizer class="..." />
 
            <committer class="...">
                <!-- refer to Committer documentation -->
            </committer>		
        </crawler>
        ...
    </crawlers>
 
</httpcollector>

The table below lists interface names that you can easily extend, and also lists available out-of-the-box implementations. In the configuration file, you have to use the fully qualified name, as defined in the Javadoc. Click on a class or interface name to go directly to its full documentation, with extra configuration options. When a default implementation exists for a configuration option taking a class attribute, it is highlighted.

Tag	Description	Classes	Interface
httpcollector	Root tag, you must give your configuration a unique identifier value.	N/A	N/A
progressDir	Directory where to store crawling progress files. Default is "./progress".	N/A	N/A
logsDir	Directory where crawl logs will be stored. Default is "./logs".	N/A	N/A
collectorListeners	Listen to collector events.		ICollectorLifeCycleListener
startURLs	URLs to start crawling from. Can be one or several of <url>, <urlsFile> (file containing URLs), <sitemap> (URL to a sitemap.xml file) or <provider> (for a dynamically generated list).	N/A	IStartURLsProvider (<provider> only)
userAgent	The crawler "User-Agent" value to identify your crawler to sites you crawl.	N/A	N/A
urlNormalizer	Normalizes incoming URLs.	GenericURLNormalizer	IURLNormalizer
delay	Handles interval between each page download.	GenericDelayResolver, ReferenceDelayResolver	IDelayResolver
numThreads	Number of execution threads for a crawler. Default is 2.	N/A	N/A
maxDepth	How many level deep to crawl from start URL(s). Default is -1 (unlimited).	N/A	N/A
maxDocuments	Maximum documents to successfully process. Default is -1 (unlimited).	N/A	N/A
workDir	Where to store files created as part of crawling activies. Default is "./work".	N/A	N/A
keepDownloads	Whether to keep downloaded files. Defaut is false.	N/A	N/A
keepOutOfScopeLinks	Keep extracted links that are out-of-scope according to start URL stayOn... flags. Defaut is false.	N/A	N/A
orphansStrategy	What to do with urls not being referenced anymore. PROCESS (default), IGNORE, or DELETE.	N/A	N/A
stopOnExceptions	What exception(s) should force a crawler to stop when triggered during the processing of a document.	N/A	N/A
crawlerListeners	Listen to crawling events.	URLStatusCrawlerEventListener	ICrawlerEventListener
crawlDataStoreFactory	URLs and crawl-related information data store.	MVStoreCrawlDataStoreFactory, JDBCCrawlDataStoreFactory, MongoCrawlDataStoreFactory	ICrawlDataStoreFactory
httpClientFactory	HTTP Client creation and initialization.	GenericHttpClientFactory	IHttpClientFactory
referenceFilters	Filter based on references (i.e. URLs).	ExtensionReferenceFilter, RegexReferenceFilter, SegmentCountURLFilter	IReferenceFilter
robotsTxt	Handle robots.txt files.	StandardRobotsTxtProvider	IRobotsTxtProvider
sitemapResolverFactory	Handle Sitemap files.	StandardSitemapResolverFactory	ISitemapResolverFactory
redirectURLProvider	Provides the target URL to use when a redirect is encountered.	GenericRedirectURLProvider	IRedirectURLProvider
recrawlableResolver	Indicates if a target URL is ready for recrawl or not.	GenericRecrawlableResolver	IRecrawlableResolver
metadataFetcher	Fetches HTTP headers for a URL.	GenericMetadataFetcher	IHttpMetadataFetcher
metadataFilters	Filter based on HTTP Headers.	ExtensionReferenceFilter, RegexReferenceFilter, RegexMetadataFilter, SegmentCountURLFilter	IMetadataFilter
canonicalLinkDetector	Detects pages with a canonical link and rejects them in favor of the canonical one.	GenericCanonicalLinkDetector	ICanonicalLinkDetector
metadataChecksummer	Create document checksum from HTTP Headers.	LastModifiedMetadataChecksummer, GenericMetadataChecksummer	IMetadataChecksummer
documentFetcher	Fetch a document from URL.	GenericDocumentFetcher, PhantomJSDocumentFetcher	IHttpDocumentFetcher
robotsMeta	Handle in-page robots instructions.	StandardRobotsMetaProvider	IRobotsMetaProvider
linkExtractors	One or more <extractor> for extracting URLs and other link data from a document.	GenericLinkExtractor, TikaLinkExtractor, RegexLinkExtractor, XMLFeedLinkExtractor	ILinkExtractor
documentFilters	Filter documents (after links were extracted).	ExtensionReferenceFilter, RegexReferenceFilter, RegexMetadataFilter, SegmentCountURLFilter	IDocumentFilter
preImportProcessors	Process a document before import.	FeaturedImageProcessor	IHttpDocumentProcessor
importer	Performs document text extraction and manipulation. It has many features and many file formats are supported. Refer to Importer configuration options.
documentChecksummer	Create a checksum from document.	MD5DocumentChecksummer	IDocumentChecksummer
postImportProcessors	Process a document after import.		IHttpDocumentProcessor
spoiledReferenceStrategizer	Establish the strategy to adopt for references that have turned bad.	GenericSpoiledReferenceStrategizer	ISpoiledReferenceStrategizer
committer	Where to commit a document when processed. Different implementations are available. Check out the list of available Committers.
crawler	Define as many crawlers as you like. They must each have a unique identifier.	N/A	N/A

Importer Configuration Options

The Importer module is an integral part of the Web Crawler. It is reponsible for extracting text out of documents, and also provide document manipulation options and filtering options. Much more is found in this module distributed with the Web Crawler. Read the Importer Configuration Options.

Committer Configuration Options

The Committer module is responsible for taking the text extracted out of your collected documents and submit it to your target repository (e.g. search engine). Make sure you download a Committer implementation matching your target repository. Configuration options are specific to each committers. Refer to your committer documentation.

More Options

There is a lot more you can do to structure your configuration files the way you like. Refer to this additional documentation for more configuration options such as creating reusable configuration fragments and using variables to make your files easier to maintain and more portable across different environments.

Getting Started Configuration Flow Diagram Javadoc Tutorials FAQ