public abstract class CrawlerConfig extends Object implements IXMLConfigurable
Base Crawler configuration. Crawlers usually read this configuration upon starting up. Once execution has started, it should not be changed to avoid unexpected behaviors.
Concrete implementations inherit the following XML configuration
options (typically within a <crawler>
tag):
<numThreads>(maximum number of threads)</numThreads>
<maxDocuments>(maximum number of documents to crawl)</maxDocuments>
<orphansStrategy>[PROCESS|IGNORE|DELETE]</orphansStrategy>
<stopOnExceptions>
<!-- Repeatable -->
<exception>(fully qualified class name of a an exception)</exception>
</stopOnExceptions>
<eventListeners>
<!-- Repeatable -->
<listener
class="(IEventListener implementation)"/>
</eventListeners>
<dataStoreEngine
class="(IDataStoreEngine implementation)"/>
<referenceFilters>
<!-- Repeatable -->
<filter
class="(IReferenceFilter implementation)"
onMatch="[include|exclude]"/>
</referenceFilters>
<metadataFilters>
<!-- Repeatable -->
<filter
class="(IMetadataFilter implementation)"
onMatch="[include|exclude]"/>
</metadataFilters>
<documentFilters>
<!-- Repeatable -->
<filter
class="(IDocumentFilter implementation)"/>
</documentFilters>
<importer>
<preParseHandlers>
<!-- Repeatable -->
<handler
class="(an handler class from the Importer module)"/>
</preParseHandlers>
<documentParserFactory
class="(IDocumentParser implementation)"/>
<postParseHandlers>
<!-- Repeatable -->
<handler
class="(an handler class from the Importer module)"/>
</postParseHandlers>
<responseProcessors>
<!-- Repeatable -->
<responseProcessor
class="(IImporterResponseProcessor implementation)"/>
</responseProcessors>
</importer>
<metadataChecksummer
class="(IMetadataChecksummer implementation)"/>
<metadataDeduplicate>[false|true]</metadataDeduplicate>
<documentChecksummer
class="(IDocumentChecksummer implementation)"/>
<documentDeduplicate>[false|true]</documentDeduplicate>
<spoiledReferenceStrategizer
class="(ISpoiledReferenceStrategizer implementation)"/>
<committers>
<committer
class="(ICommitter implementation)"/>
</committers>
Modifier and Type | Class and Description |
---|---|
static class |
CrawlerConfig.OrphansStrategy |
Constructor and Description |
---|
CrawlerConfig()
Creates a new crawler configuration.
|
Modifier and Type | Method and Description |
---|---|
void |
addEventListeners(IEventListener<?>... eventListeners)
Adds event listeners.
|
void |
addEventListeners(List<IEventListener<?>> eventListeners)
Adds event listeners.
|
void |
clearEventListeners()
Clears all event listeners.
|
boolean |
equals(Object other) |
ICommitter |
getCommitter()
Deprecated.
Since 2.0.0, use
getCommitters() . |
List<ICommitter> |
getCommitters()
Gets Committers responsible for persisting information
to a target location/repository.
|
IDataStoreEngine |
getDataStoreEngine()
Gets the crawl data store factory.
|
IDocumentChecksummer |
getDocumentChecksummer()
Gets the document checksummer.
|
List<IDocumentFilter> |
getDocumentFilters()
Gets the document filters.
|
List<IEventListener<?>> |
getEventListeners()
Gets event listeners.
|
String |
getId()
Gets this crawler unique identifier.
|
ImporterConfig |
getImporterConfig()
Gets the Importer module configuration.
|
int |
getMaxDocuments()
Gets the maximum number of documents that can be processed.
|
IMetadataChecksummer |
getMetadataChecksummer()
Gets the metadata checksummer.
|
List<IMetadataFilter> |
getMetadataFilters()
Gets metadata filters.
|
int |
getNumThreads()
Gets the maximum number of threads a crawler can use.
|
CrawlerConfig.OrphansStrategy |
getOrphansStrategy()
Gets the strategy to adopt when there are orphans.
|
List<IReferenceFilter> |
getReferenceFilters()
Gets reference filters
|
ISpoiledReferenceStrategizer |
getSpoiledReferenceStrategizer()
Gets the spoiled state strategy resolver.
|
List<Class<? extends Exception>> |
getStopOnExceptions()
Gets the exceptions we want to stop the crawler on.
|
int |
hashCode() |
boolean |
isDocumentDeduplicate()
Gets whether to turn on deduplication based on document checksum.
|
boolean |
isMetadataDeduplicate()
Gets whether to turn on deduplication based on metadata checksum.
|
protected abstract void |
loadCrawlerConfigFromXML(XML xml) |
void |
loadFromXML(XML xml) |
protected abstract void |
saveCrawlerConfigToXML(XML xml) |
void |
saveToXML(XML xml) |
void |
setCommitter(ICommitter committer)
Deprecated.
Since 2.0.0, use
setCommitters(ICommitter...) . |
void |
setCommitters(ICommitter... committers)
Sets Committers responsible for persisting information
to a target location/repository.
|
void |
setCommitters(List<ICommitter> committers)
Sets Committers responsible for persisting information
to a target location/repository.
|
void |
setDataStoreEngine(IDataStoreEngine dataStoreEngine)
Sets the crawl data store factory.
|
void |
setDocumentChecksummer(IDocumentChecksummer documentChecksummer)
Sets the document checksummer.
|
void |
setDocumentDeduplicate(boolean documentDeduplicate)
Sets whether to turn on deduplication based on document checksum.
|
void |
setDocumentFilters(IDocumentFilter... documentFilters)
Sets document filters.
|
void |
setDocumentFilters(List<IDocumentFilter> documentFilters)
Sets document filters.
|
void |
setEventListeners(IEventListener<?>... eventListeners)
Sets event listeners.
|
void |
setEventListeners(List<IEventListener<?>> eventListeners)
Sets event listeners.
|
void |
setId(String id)
Sets this crawler unique identifier.
|
void |
setImporterConfig(ImporterConfig importerConfig)
Sets the Importer module configuration.
|
void |
setMaxDocuments(int maxDocuments)
Sets the maximum number of documents that can be processed.
|
void |
setMetadataChecksummer(IMetadataChecksummer metadataChecksummer)
Sets the metadata checksummer.
|
void |
setMetadataDeduplicate(boolean metadataDeduplicate)
Sets whether to turn on deduplication based on metadata checksum.
|
void |
setMetadataFilters(IMetadataFilter... metadataFilters)
Sets metadata filters.
|
void |
setMetadataFilters(List<IMetadataFilter> metadataFilters)
Sets metadata filters.
|
void |
setNumThreads(int numThreads)
Sets the maximum number of threads a crawler can use.
|
void |
setOrphansStrategy(CrawlerConfig.OrphansStrategy orphansStrategy)
Sets the strategy to adopt when there are orphans.
|
void |
setReferenceFilters(IReferenceFilter... referenceFilters)
Sets reference filters.
|
void |
setReferenceFilters(List<IReferenceFilter> referenceFilters)
Sets reference filters.
|
void |
setSpoiledReferenceStrategizer(ISpoiledReferenceStrategizer spoiledReferenceStrategizer)
Sets the spoiled state strategy resolver.
|
void |
setStopOnExceptions(Class<? extends Exception>... stopOnExceptions)
Sets the exceptions we want to stop the crawler on.
|
void |
setStopOnExceptions(List<Class<? extends Exception>> stopOnExceptions)
Sets the exceptions we want to stop the crawler on.
|
String |
toString() |
public String getId()
public void setId(String id)
id
- unique identifierpublic int getNumThreads()
public void setNumThreads(int numThreads)
numThreads
- number of threadspublic int getMaxDocuments()
public void setMaxDocuments(int maxDocuments)
maxDocuments
- maximum number of documents that can be processedpublic CrawlerConfig.OrphansStrategy getOrphansStrategy()
Gets the strategy to adopt when there are orphans. Orphans are references that were processed in a previous run, but were not in the current run. In other words, they are leftovers from a previous run that were not re-encountered in the current.
Unless explicitly stated otherwise by an implementing class, the default
strategy is to PROCESS
orphans.
Setting a null
value is the same as setting
IGNORE
.
Since 1.2.0, unless otherwise stated in implementing classes,
the default orphan strategy is now PROCESS
.
Be careful: Setting the orphan strategy to DELETE
is NOT recommended in most cases. With some collectors, a temporary
failure such as a network outage or a web page timing out, may cause
some documents not to be crawled. When this happens, unreachable
documents would be considered "orphans" and be deleted while under
normal circumstances, they should be kept. Re-processing them
(default), is usually the safest approach to confirm they still
exist before deleting or updating them.
public void setOrphansStrategy(CrawlerConfig.OrphansStrategy orphansStrategy)
Sets the strategy to adopt when there are orphans.
orphansStrategy
- orphans strategygetOrphansStrategy()
public List<Class<? extends Exception>> getStopOnExceptions()
public void setStopOnExceptions(Class<? extends Exception>... stopOnExceptions)
stopOnExceptions
- exceptions that will stop the crawler when
encounteredpublic void setStopOnExceptions(List<Class<? extends Exception>> stopOnExceptions)
stopOnExceptions
- exceptions that will stop the crawler when
encounteredpublic IDataStoreEngine getDataStoreEngine()
public void setDataStoreEngine(IDataStoreEngine dataStoreEngine)
dataStoreEngine
- crawl data store factory.public ISpoiledReferenceStrategizer getSpoiledReferenceStrategizer()
public void setSpoiledReferenceStrategizer(ISpoiledReferenceStrategizer spoiledReferenceStrategizer)
spoiledReferenceStrategizer
- spoiled state strategy resolverpublic List<IReferenceFilter> getReferenceFilters()
public void setReferenceFilters(IReferenceFilter... referenceFilters)
referenceFilters
- reference filters to setpublic void setReferenceFilters(List<IReferenceFilter> referenceFilters)
referenceFilters
- the referenceFilters to setpublic List<IDocumentFilter> getDocumentFilters()
public void setDocumentFilters(IDocumentFilter... documentFilters)
documentFilters
- document filterspublic void setDocumentFilters(List<IDocumentFilter> documentFilters)
documentFilters
- document filterspublic List<IMetadataFilter> getMetadataFilters()
public void setMetadataFilters(IMetadataFilter... metadataFilters)
metadataFilters
- metadata filterspublic void setMetadataFilters(List<IMetadataFilter> metadataFilters)
metadataFilters
- metadata filterspublic IMetadataChecksummer getMetadataChecksummer()
HttpCrawlerConfig
.public void setMetadataChecksummer(IMetadataChecksummer metadataChecksummer)
metadataChecksummer
- metadata checksummerHttpCrawlerConfig
.public IDocumentChecksummer getDocumentChecksummer()
public void setDocumentChecksummer(IDocumentChecksummer documentChecksummer)
documentChecksummer
- document checksummerpublic ImporterConfig getImporterConfig()
public void setImporterConfig(ImporterConfig importerConfig)
importerConfig
- Importer module configuration@Deprecated public ICommitter getCommitter()
getCommitters()
.@Deprecated public void setCommitter(ICommitter committer)
setCommitters(ICommitter...)
.committer
- Committer module configurationpublic List<ICommitter> getCommitters()
null
)public void setCommitters(List<ICommitter> committers)
committers
- list of Committerspublic void setCommitters(ICommitter... committers)
committers
- list of Committerspublic List<IEventListener<?>> getEventListeners()
IEventListener
.public void setEventListeners(IEventListener<?>... eventListeners)
IEventListener
.eventListeners
- event listeners.public void setEventListeners(List<IEventListener<?>> eventListeners)
IEventListener
.eventListeners
- event listeners.public void addEventListeners(IEventListener<?>... eventListeners)
IEventListener
.eventListeners
- event listeners.public void addEventListeners(List<IEventListener<?>> eventListeners)
IEventListener
.eventListeners
- event listeners.public void clearEventListeners()
IEventListener
are not cleared.public boolean isMetadataDeduplicate()
getMetadataChecksummer()
returns null
.
Not recommended unless you know for sure your metadata
checksum is acceptably unique.public void setMetadataDeduplicate(boolean metadataDeduplicate)
getMetadataChecksummer()
returns null
.
Not recommended unless you know for sure your metadata
checksum is acceptably unique.metadataDeduplicate
- true
to turn on
metadata-based deduplicationpublic boolean isDocumentDeduplicate()
getDocumentChecksummer()
returns null
.
Not recommended unless you know for sure your document
checksum is acceptably unique.public void setDocumentDeduplicate(boolean documentDeduplicate)
getDocumentChecksummer()
returns null
.
Not recommended unless you know for sure your document
checksum is acceptably unique.documentDeduplicate
- true
to turn on
document-based deduplicationpublic void saveToXML(XML xml)
saveToXML
in interface IXMLConfigurable
protected abstract void saveCrawlerConfigToXML(XML xml)
public final void loadFromXML(XML xml)
loadFromXML
in interface IXMLConfigurable
protected abstract void loadCrawlerConfigFromXML(XML xml)
Copyright © 2014–2023 Norconex Inc.. All rights reserved.