Norconex File System Crawler

Configuration

Configuration Quick start

To get started quickly, download the latest version of Norconex File System Crawler and locate the file ./examples/HOWTO_RUN_EXAMPLES.txt. This file will point you to a functional configuration file and will have you running a very simple crawler in no time. These sample files are also made available here for your convenience:

File System Crawler Configuration Options

To get the full potential of Norconex File System Crawler and learn which parts can easily be extended, refer to the following for an XML-based configuration. Entries with a "class" attribute are expecting an implementation of your choice. The File System Crawler API offers several concrete implementations already. Developers can also create their own by implementing the proper Java interfaces. Refer to the Norconex File System Crawler JavaDoc and/or see further down what interfaces you can implement. Go to the Extend the File System Crawler page for more details on adding your own implementations.

Click on a tag name to jump to/from its documentation.

<fscollector id="...">
 
    <progressDir>...</progressDir>
    <logsDir>...</logsDir>
    <collectorListeners>
        <listener class="..."/>
        ...
    </collectorListeners>
 
    <crawlerDefaults>
        <!-- All crawler options defined below (except for the crawler "id") 
             can be set here as default shared between multiple crawlers. 
             Configuration blocks defined for a specific crawler always takes
             precendence. -->
    </crawlerDefaults>
 
    <crawlers>
        <crawler id="...">
            <startPaths
                <path>...</path>
                <pathsFile>...</pathsFile>
                <provider class="..."/>
                ...
            </startPaths>
            <workDir>...</workDir>
            <numThreads>...</numThreads>
            <maxDocuments>...</maxDocuments>
            <keepDownloads>...</keepDownloads>
            <orphansStrategy>...</orphansStrategy>
            <stopOnExceptions>
              <exception>...</exception>
              ...
            </stopOnExceptions>
            <crawlerListeners>
                <listener class="..."/>
                ...
            </crawlerListeners>
            <crawlDataStoreFactory class="..." />
            <optionsProvider class="..." />
            <referenceFilters>
                <filter class="..." />
                ...
            </referenceFilters>
            <metadataFetcher class="..." />
            <metadataFilters>
                <filter class="..." />
                ...
            </metadataFilters>
            <metadataChecksummer class="..." />
            <documentFetcher class="..." />
            <documentFilters>
                <filter class="..." />
                ...
            </documentFilters>
            <preImportProcessors>
                <processor class="..." />
                ...
            </preImportProcessors>
 
            <importer>
                <!-- refer to Importer documentation -->
            </importer>
 
            <documentChecksummer class="..." />
 
            <postImportProcessors>
                <processor class="..."></processor>
            </postImportProcessors>
 
            <spoiledReferenceStrategizer class="..." />
 
            <committer class="..." />
        </crawler>
        ...
    </crawlers>
 
</fscollector>

The table below lists interface names that you can easily extend, and also lists available out-of-the-box implementations. In the configuration file, you have to use the fully qualified name, as defined in the Javadoc (you can use variables to shorten package names). Click on a class or interface name to go directly to its full documentation, with extra configuration options. When a default implementation exists for a configuration option taking a class attribute, it is highlighted.

Tag Description Classes Interface
fscollector Root tag, you must give your configuration a unique identifier value. N/A N/A
progressDir Directory where to store crawling progress files. Default is "./progress". N/A N/A
logsDir Directory where crawl logs will be stored. Default is "./logs". N/A N/A
collectorListeners Listen to collector events. ICollectorLifeCycleListener
startPaths Paths to start crawling from. Can be one or several of <path>, <pathsFile> (file containing URLs) or <provider> (for a dynamically generated list). N/A IStartPathsProvider (<provider> only)
numThreads Number of execution threads for a crawler. Default is 2. N/A N/A
maxDocuments Maximum files to successfully process. Default is -1 (unlimited). N/A N/A
workDir Where to store files created as part of crawling activies. Default is "./work". N/A N/A
keepDownloads Whether to keep downloaded files. Defaut is false. N/A N/A
orphansStrategy What to do with urls not being referenced anymore. PROCESS (default), IGNORE, or DELETE. N/A N/A
stopOnExceptions What exception(s) should force a crawler to stop when triggered during the processing of a document. N/A N/A
crawlerListeners Listen to crawling events. ICrawlerEventListener
crawlDataStoreFactory URLs and crawl-related information data store. MVStoreCrawlDataStoreFactory, BasicJDBCCrawlDataStoreFactory, MongoCrawlDataStoreFactory ICrawlDataStoreFactory
optionsProvider Provider of file system options. GenericFilesystemOptionsProvider IFilesystemOptionsProvider
referenceFilters Filter based on refereences (i.e. URLs). ExtensionReferenceFilter, RegexReferenceFilter, IReferenceFilter
metadataFetcher Fetch a file metadata. GenericFileMetadataFetcher IFileMetadataFetcher
metadataFilters Filter based on file properties. ExtensionReferenceFilter, RegexReferenceFilter, RegexMetadataFilter, IMetadataFilter
metadataChecksummer Create document checksum from file properties. FileMetadataChecksummer, GenericMetadataChecksummer IMetadataChecksummer
documentFetcher Fetch a document. GenericFileDocumentFetcher IFileDocumentFetcher
documentFilters Filter documents. ExtensionReferenceFilter, RegexReferenceFilter, RegexMetadataFilter, IDocumentFilter
preImportProcessors Process a document before import. IFileDocumentProcessor
importer Performs document text extraction and manipulation. It has many features and many file formats are supported. Refer to Importer configuration options.
documentChecksummer Create a checksum from document. MD5DocumentChecksummer IDocumentChecksummer
postImportProcessors Process a document after import. IFileDocumentProcessor
spoiledReferenceStrategizer Establish the strategy to adopt for references that have turned bad. GenericSpoiledReferenceStrategizer ISpoiledReferenceStrategizer
committer Where to commit a document when processed. Different implementations are available. Check the list of available Committers.
crawler Define as many crawlers as you like. They must each have a unique identifier. N/A N/A

Importer Configuration Options

The Importer module is an integral part of the File System Crawler. It is reponsible for extracting text out of documents. It also provides document manipulation options and filtering options. Much more is found in this module distributed with the File System Crawler. Read the Importer Configuration Options.

Committer Configuration Options

The Committer module is responsible for taking the text extracted out of your collected documents and submit it to your target repository (e.g. search engine). Make sure you download a Committer implementation matching your target repository. Configuration options for is specific to each committers. Refer to your committer documentation.

More Options

There is a lot more you can do to structure your configuration files the way you like. Refer to this additional documentation for more configuration options such as creating reusable configuration fragments and using variables to make your files easier to maintain and more portable across different environments.