To get started quickly, download the latest version of Norconex File System Crawler and locate the file ./examples/HOWTO_RUN_EXAMPLES.txt. This file will point you to a functional configuration file and will have you running a very simple crawler in no time. These sample files are also made available here for your convenience:
To get the full potential of Norconex File System Crawler and learn which parts can easily be extended, refer to the following for an XML-based configuration. Entries with a "class" attribute are expecting an implementation of your choice. The File System Crawler API offers several concrete implementations already. Developers can also create their own by implementing the proper Java interfaces. Refer to the Norconex File System Crawler JavaDoc and/or see further down what interfaces you can implement. Go to the Extend the File System Crawler page for more details on adding your own implementations.
Click on a tag name to jump to/from its documentation.
<fscollector id="..."> <progressDir>...</progressDir> <logsDir>...</logsDir> <collectorListeners> <listener class="..."/> ... </collectorListeners> <crawlerDefaults> <!-- All crawler options defined below (except for the crawler "id") can be set here as default shared between multiple crawlers. Configuration blocks defined for a specific crawler always takes precendence. --> </crawlerDefaults> <crawlers> <crawler id="..."> <startPaths <path>...</path> <pathsFile>...</pathsFile> <provider class="..."/> ... </startPaths> <workDir>...</workDir> <numThreads>...</numThreads> <maxDocuments>...</maxDocuments> <keepDownloads>...</keepDownloads> <orphansStrategy>...</orphansStrategy> <stopOnExceptions> <exception>...</exception> ... </stopOnExceptions> <crawlerListeners> <listener class="..."/> ... </crawlerListeners> <crawlDataStoreFactory class="..." /> <optionsProvider class="..." /> <referenceFilters> <filter class="..." /> ... </referenceFilters> <metadataFetcher class="..." /> <metadataFilters> <filter class="..." /> ... </metadataFilters> <metadataChecksummer class="..." /> <documentFetcher class="..." /> <documentFilters> <filter class="..." /> ... </documentFilters> <preImportProcessors> <processor class="..." /> ... </preImportProcessors> <importer> <!-- refer to Importer documentation --> </importer> <documentChecksummer class="..." /> <postImportProcessors> <processor class="..."></processor> </postImportProcessors> <spoiledReferenceStrategizer class="..." /> <committer class="..." /> </crawler> ... </crawlers> </fscollector>
The table below lists interface names that you can easily extend, and
also lists available out-of-the-box implementations.
In the configuration file, you have to use the fully qualified name,
as defined in the Javadoc (you can use variables to shorten package names).
Click on a class or interface name to go directly
to its full documentation, with extra configuration options.
When a default implementation exists for a configuration option taking
a class
attribute, it is highlighted.
Tag | Description | Classes | Interface |
---|---|---|---|
fscollector | Root tag, you must give your configuration a unique identifier value. | N/A | N/A |
progressDir | Directory where to store crawling progress files. Default is "./progress". | N/A | N/A |
logsDir | Directory where crawl logs will be stored. Default is "./logs". | N/A | N/A |
collectorListeners | Listen to collector events. | ICollectorLifeCycleListener | |
startPaths | Paths to start crawling from. Can be one or several of <path>, <pathsFile> (file containing URLs) or <provider> (for a dynamically generated list). | N/A | IStartPathsProvider (<provider> only) |
numThreads | Number of execution threads for a crawler. Default is 2. | N/A | N/A |
maxDocuments | Maximum files to successfully process. Default is -1 (unlimited). | N/A | N/A |
workDir | Where to store files created as part of crawling activies. Default is "./work". | N/A | N/A |
keepDownloads | Whether to keep downloaded files. Defaut is false. | N/A | N/A |
orphansStrategy | What to do with urls not being referenced anymore. PROCESS (default), IGNORE, or DELETE. | N/A | N/A |
stopOnExceptions | What exception(s) should force a crawler to stop when triggered during the processing of a document. | N/A | N/A |
crawlerListeners | Listen to crawling events. | ICrawlerEventListener | |
crawlDataStoreFactory | URLs and crawl-related information data store. | MVStoreCrawlDataStoreFactory, BasicJDBCCrawlDataStoreFactory, MongoCrawlDataStoreFactory | ICrawlDataStoreFactory |
optionsProvider | Provider of file system options. | GenericFilesystemOptionsProvider | IFilesystemOptionsProvider |
referenceFilters | Filter based on refereences (i.e. URLs). | ExtensionReferenceFilter, RegexReferenceFilter, | IReferenceFilter |
metadataFetcher | Fetch a file metadata. | GenericFileMetadataFetcher | IFileMetadataFetcher |
metadataFilters | Filter based on file properties. | ExtensionReferenceFilter, RegexReferenceFilter, RegexMetadataFilter, | IMetadataFilter |
metadataChecksummer | Create document checksum from file properties. | FileMetadataChecksummer, GenericMetadataChecksummer | IMetadataChecksummer |
documentFetcher | Fetch a document. | GenericFileDocumentFetcher | IFileDocumentFetcher |
documentFilters | Filter documents. | ExtensionReferenceFilter, RegexReferenceFilter, RegexMetadataFilter, | IDocumentFilter |
preImportProcessors | Process a document before import. | IFileDocumentProcessor | |
importer | Performs document text extraction and manipulation. It has many features and many file formats are supported. Refer to Importer configuration options. | ||
documentChecksummer | Create a checksum from document. | MD5DocumentChecksummer | IDocumentChecksummer |
postImportProcessors | Process a document after import. | IFileDocumentProcessor | |
spoiledReferenceStrategizer | Establish the strategy to adopt for references that have turned bad. | GenericSpoiledReferenceStrategizer | ISpoiledReferenceStrategizer |
committer | Where to commit a document when processed. Different implementations are available. Check the list of available Committers. | ||
crawler | Define as many crawlers as you like. They must each have a unique identifier. | N/A | N/A |
The Importer module is an integral part of the File System Crawler. It is reponsible for extracting text out of documents. It also provides document manipulation options and filtering options. Much more is found in this module distributed with the File System Crawler. Read the Importer Configuration Options.
The Committer module is responsible for taking the text extracted out of your collected documents and submit it to your target repository (e.g. search engine). Make sure you download a Committer implementation matching your target repository. Configuration options for is specific to each committers. Refer to your committer documentation.