public interface ICrawlerConfig extends IXMLConfigurable
Modifier and Type | Interface and Description |
---|---|
static class |
ICrawlerConfig.OrphansStrategy |
Modifier and Type | Method and Description |
---|---|
ICommitter |
getCommitter()
Gets the Committer module configuration.
|
ICrawlDataStoreFactory |
getCrawlDataStoreFactory()
Gets the crawl data store factory a crawler should use.
|
ICrawlerEventListener[] |
getCrawlerListeners()
Gets crawler event listeners.
|
IDocumentChecksummer |
getDocumentChecksummer()
Gets the document checksummer.
|
IDocumentFilter[] |
getDocumentFilters()
Gets the document filters.
|
String |
getId()
Gets this crawler unique identifier.
|
ImporterConfig |
getImporterConfig()
Gets the Importer module configuration.
|
int |
getMaxDocuments()
Gets the maximum number of documents that can be processed.
|
IMetadataFilter[] |
getMetadataFilters()
Gets the metadata filters.
|
int |
getNumThreads()
Gets the number of threads (maximum) a crawler should use.
|
ICrawlerConfig.OrphansStrategy |
getOrphansStrategy()
Gets the strategy to adopt when there are orphans.
|
IReferenceFilter[] |
getReferenceFilters()
Gets the reference filters.
|
ISpoiledReferenceStrategizer |
getSpoiledReferenceStrategizer()
Gets the spoiled state strategy resolver.
|
Class<? extends Exception>[] |
getStopOnExceptions()
Gets the exceptions we want to stop the crawler on.
|
File |
getWorkDir()
Gets the crawler working directory where many files created at
execution time are stored.
|
loadFromXML, saveToXML
String getId()
File getWorkDir()
int getNumThreads()
int getMaxDocuments()
Class<? extends Exception>[] getStopOnExceptions()
ICrawlerConfig.OrphansStrategy getOrphansStrategy()
Gets the strategy to adopt when there are orphans. Orphans are references that were processed in a previous run, but were not in the current run. In other words, they are leftovers from a previous run that were not re-encountered in the current.
Unless explicitly stated otherwise by an implementing class, the default
strategy is to PROCESS
orphans.
Setting a null
value is the same as setting
IGNORE
.
Since 1.2.0, unless otherwise stated in implementing classes,
the default orphan strategy is now PROCESS
.
Be careful: Setting the orphan strategy to DELETE
is NOT recommended in most cases. With some collectors, a temporary
failure such as a network outage or a web page timing out, may cause
some documents not to be crawled. When this happens, unreachable
documents would be considered "orphans" and be deleted while under
normal circumstances, they should be kept. Re-processing them
(default), is usually the safest approach to confirm they still
exist before deleting or updating them.
ICrawlDataStoreFactory getCrawlDataStoreFactory()
ICrawlerEventListener[] getCrawlerListeners()
ImporterConfig getImporterConfig()
ICommitter getCommitter()
IReferenceFilter[] getReferenceFilters()
IDocumentFilter[] getDocumentFilters()
IMetadataFilter[] getMetadataFilters()
IDocumentChecksummer getDocumentChecksummer()
ISpoiledReferenceStrategizer getSpoiledReferenceStrategizer()
Copyright © 2014–2021 Norconex Inc.. All rights reserved.