Norconex File System Crawler

Configuration

Configuration Quick start

To get started quickly, download the latest version of Norconex File System Crawler and locate the file ./examples/HOWTO_RUN_EXAMPLES.txt. This file will point you to a functional configuration file and will have you running a very simple crawler in no time. These sample files are also made available here for your convenience:

File System Crawler Configuration Options

To get the full potential of Norconex File System Crawler and learn which parts can easily be extended, refer to the following for an XML-based configuration. Entries with a "class" attribute are expecting an implementation of your choice. The File System Crawler API offers several concrete implementations already. Developers can also create their own by implementing the proper Java interfaces. Refer to the Norconex File System Crawler JavaDoc and/or see further down what interfaces you can implement. Go to the Extend the File System Crawler page for more details on adding your own implementations.

Click on a tag name to jump to/from its documentation.

<fscollector id="...">
 
    <progressDir>...</progressDir>
    <logsDir>...</logsDir>
    <collectorListeners>
        <listener class="..."/>
        ...
    </collectorListeners>
 
    <crawlerDefaults>
        <!-- All crawler options defined below (except for the crawler "id") 
             can be set here as default shared between multiple crawlers. 
             Configuration blocks defined for a specific crawler always takes
             precendence. -->
    </crawlerDefaults>
 
    <crawlers>
        <crawler id="...">
            <startPaths
                <path>...</path>
                <pathsFile>...</pathsFile>
                <provider class="..."/>
                ...
            </startPaths>
            <workDir>...</workDir>
            <numThreads>...</numThreads>
            <maxDocuments>...</maxDocuments>
            <keepDownloads>...</keepDownloads>
            <orphansStrategy>...</orphansStrategy>
            <stopOnExceptions>
              <exception>...</exception>
              ...
            </stopOnExceptions>
            <crawlerListeners>
                <listener class="..."/>
                ...
            </crawlerListeners>
            <crawlDataStoreFactory class="..." />
            <optionsProvider class="..." />
            <referenceFilters>
                <filter class="..." />
                ...
            </referenceFilters>
            <metadataFetcher class="..." />
            <metadataFilters>
                <filter class="..." />
                ...
            </metadataFilters>
            <metadataChecksummer class="..." />
            <documentFetcher class="..." />
            <documentFilters>
                <filter class="..." />
                ...
            </documentFilters>
            <preImportProcessors>
                <processor class="..." />
                ...
            </preImportProcessors>
 
            <importer>
                <!-- refer to Importer documentation -->
            </importer>
 
            <documentChecksummer class="..." />
 
            <postImportProcessors>
                <processor class="..."></processor>
            </postImportProcessors>
 
            <spoiledReferenceStrategizer class="..." />
 
            <committer class="..." />
        </crawler>
        ...
    </crawlers>
 
</fscollector>

The table below lists interface names that you can easily extend, and also lists available out-of-the-box implementations. In the configuration file, you have to use the fully qualified name, as defined in the Javadoc (you can use variables to shorten package names). Click on a class or interface name to go directly to its full documentation, with extra configuration options. When a default implementation exists for a configuration option taking a class attribute, it is highlighted.

Tag	Description	Classes	Interface
fscollector	Root tag, you must give your configuration a unique identifier value.	N/A	N/A
progressDir	Directory where to store crawling progress files. Default is "./progress".	N/A	N/A
logsDir	Directory where crawl logs will be stored. Default is "./logs".	N/A	N/A
collectorListeners	Listen to collector events.		ICollectorLifeCycleListener
startPaths	Paths to start crawling from. Can be one or several of <path>, <pathsFile> (file containing URLs) or <provider> (for a dynamically generated list).	N/A	IStartPathsProvider (<provider> only)
numThreads	Number of execution threads for a crawler. Default is 2.	N/A	N/A
maxDocuments	Maximum files to successfully process. Default is -1 (unlimited).	N/A	N/A
workDir	Where to store files created as part of crawling activies. Default is "./work".	N/A	N/A
keepDownloads	Whether to keep downloaded files. Defaut is false.	N/A	N/A
orphansStrategy	What to do with urls not being referenced anymore. PROCESS (default), IGNORE, or DELETE.	N/A	N/A
stopOnExceptions	What exception(s) should force a crawler to stop when triggered during the processing of a document.	N/A	N/A
crawlerListeners	Listen to crawling events.		ICrawlerEventListener
crawlDataStoreFactory	URLs and crawl-related information data store.	MVStoreCrawlDataStoreFactory, BasicJDBCCrawlDataStoreFactory, MongoCrawlDataStoreFactory	ICrawlDataStoreFactory
optionsProvider	Provider of file system options.	GenericFilesystemOptionsProvider	IFilesystemOptionsProvider
referenceFilters	Filter based on refereences (i.e. URLs).	ExtensionReferenceFilter, RegexReferenceFilter,	IReferenceFilter
metadataFetcher	Fetch a file metadata.	GenericFileMetadataFetcher	IFileMetadataFetcher
metadataFilters	Filter based on file properties.	ExtensionReferenceFilter, RegexReferenceFilter, RegexMetadataFilter,	IMetadataFilter
metadataChecksummer	Create document checksum from file properties.	FileMetadataChecksummer, GenericMetadataChecksummer	IMetadataChecksummer
documentFetcher	Fetch a document.	GenericFileDocumentFetcher	IFileDocumentFetcher
documentFilters	Filter documents.	ExtensionReferenceFilter, RegexReferenceFilter, RegexMetadataFilter,	IDocumentFilter
preImportProcessors	Process a document before import.		IFileDocumentProcessor
importer	Performs document text extraction and manipulation. It has many features and many file formats are supported. Refer to Importer configuration options.
documentChecksummer	Create a checksum from document.	MD5DocumentChecksummer	IDocumentChecksummer
postImportProcessors	Process a document after import.		IFileDocumentProcessor
spoiledReferenceStrategizer	Establish the strategy to adopt for references that have turned bad.	GenericSpoiledReferenceStrategizer	ISpoiledReferenceStrategizer
committer	Where to commit a document when processed. Different implementations are available. Check the list of available Committers.
crawler	Define as many crawlers as you like. They must each have a unique identifier.	N/A	N/A

Importer Configuration Options

The Importer module is an integral part of the File System Crawler. It is reponsible for extracting text out of documents. It also provides document manipulation options and filtering options. Much more is found in this module distributed with the File System Crawler. Read the Importer Configuration Options.

Committer Configuration Options

The Committer module is responsible for taking the text extracted out of your collected documents and submit it to your target repository (e.g. search engine). Make sure you download a Committer implementation matching your target repository. Configuration options for is specific to each committers. Refer to your committer documentation.

More Options

There is a lot more you can do to structure your configuration files the way you like. Refer to this additional documentation for more configuration options such as creating reusable configuration fragments and using variables to make your files easier to maintain and more portable across different environments.