AbstractCrawler (Norconex Collector Core 1.10.1 API)

java.lang.Object
- com.norconex.jef4.job.AbstractResumableJob
- - com.norconex.collector.core.crawler.AbstractCrawler

All Implemented Interfaces:

ICrawler, IJob
```
public abstract class AbstractCrawler
extends AbstractResumableJob
implements ICrawler
```
Abstract crawler implementation providing a common base to building crawlers.

As of 1.6.1, JMX support is disabled by default. To enable it, set the system property "enableJMX" to true. You can do so by adding this to your Java launch command:
```
     -DenableJMX=true
 
```
Author:

Pascal Essiembre

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

class AbstractCrawler.CopyIfNullBeanUtilsBean

Nested Classes
Modifier and Type	Class and Description
`class`	`AbstractCrawler.CopyIfNullBeanUtilsBean`

Constructor Summary

Constructors
Constructor and Description

AbstractCrawler(ICrawlerConfig config)
Constructor.

Constructors
Constructor and Description
`AbstractCrawler(ICrawlerConfig config)` Constructor.

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`protected void`	`beforeFinalizeDocumentProcessing(BaseCrawlData crawlData, ICrawlDataStore store, ImporterDocument doc, ICrawlData cachedCrawlData)` Gives implementors a change to take action on a document before its processing is being finalized (cycle end-of-life for a crawled reference).
`protected abstract void`	`cleanupExecution(JobStatusUpdater statusUpdater, JobSuite suite, ICrawlDataStore refStore)`
`protected ICrawlDataStore`	`createCrawlDataStore(boolean resume)`
`protected abstract BaseCrawlData`	`createEmbeddedCrawlData(String embeddedReference, ICrawlData parentCrawlData)`
`protected void`	`deleteCacheOrphans(ICrawlDataStore crawlDataStore, JobStatusUpdater statusUpdater, JobSuite suite)`
`protected void`	`execute(JobStatusUpdater statusUpdater, JobSuite suite, ICrawlDataStore crawlDataStore)`
`protected abstract void`	`executeCommitterPipeline(ICrawler crawler, ImporterDocument doc, ICrawlDataStore crawlDataStore, BaseCrawlData crawlData, BaseCrawlData cachedCrawlData)`
`protected abstract ImporterResponse`	`executeImporterPipeline(ImporterPipelineContext context)`
`protected abstract void`	`executeQueuePipeline(ICrawlData crawlData, ICrawlDataStore crawlDataStore)`
`void`	`fireCrawlerEvent(String eventType, ICrawlData crawlData, Object subject)`
`File`	`getBaseDownloadDir()`
`ICrawlerConfig`	`getCrawlerConfig()` Gets the crawler configuration
`File`	`getCrawlerDownloadDir()`
`CrawlerEventManager`	`getCrawlerEventManager()` Gets the crawler events manager.
`String`	`getId()`
`Importer`	`getImporter()` Gets the crawler Importer module.
`CachedStreamFactory`	`getStreamFactory()`
`protected void`	`handleOrphans(ICrawlDataStore crawlStore, JobStatusUpdater statusUpdater, JobSuite suite)`
`protected void`	`initCrawlData(ICrawlData crawlData, ICrawlData cachedCrawlData, ImporterDocument document)`
`protected boolean`	`isMaxDocuments()`
`boolean`	`isStopped()` Whether the crawler job was stopped.
`protected abstract void`	`markReferenceVariationsAsProcessed(BaseCrawlData crawlData, ICrawlDataStore refStore)`
`protected abstract void`	`prepareExecution(JobStatusUpdater statusUpdater, JobSuite suite, ICrawlDataStore refStore, boolean resume)`
`protected boolean`	`processNextReference(JobStatusUpdater statusUpdater, ImporterPipelineContext context)`
`protected void`	`processReferences(JobStatusUpdater statusUpdater, JobSuite suite, ImporterPipelineContext contextPrototype)`
`protected void`	`reprocessCacheOrphans(ICrawlDataStore crawlDataStore, JobStatusUpdater statusUpdater, JobSuite suite)`
`protected void`	`resumeExecution(JobStatusUpdater statusUpdater, JobSuite suite)`
`protected void`	`startExecution(JobStatusUpdater statusUpdater, JobSuite suite)`
`void`	`stop(IJobStatus jobStatus, JobSuite suite)`
`protected abstract ImporterDocument`	`wrapDocument(ICrawlData crawlData, ImporterDocument document)`

Methods inherited from class com.norconex.jef4.job.AbstractResumableJob
execute

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface com.norconex.jef4.job.IJob
execute

Constructor Detail
- AbstractCrawler
```
public AbstractCrawler(ICrawlerConfig config)
```
  Constructor.
  
  Parameters:
  
  config - crawler configuration

Method Detail

getId
```
public String getId()
```
Specified by:

getId in interface IJob

isStopped
```
public boolean isStopped()
```
Whether the crawler job was stopped.

Returns:

true if stopped

stop

public void stop(IJobStatus jobStatus,
                 JobSuite suite)

Specified by:: stop in interface IJob

getImporter
```
public Importer getImporter()
```
Description copied from interface: ICrawler

Gets the crawler Importer module.

Specified by:

getImporter in interface ICrawler

Returns:

the Importer

getStreamFactory

public CachedStreamFactory getStreamFactory()

getCrawlerConfig
```
public ICrawlerConfig getCrawlerConfig()
```
Gets the crawler configuration

Specified by:

getCrawlerConfig in interface ICrawler

Returns:

the crawler configuration

fireCrawlerEvent

public void fireCrawlerEvent(String eventType,
                             ICrawlData crawlData,
                             Object subject)

getBaseDownloadDir
```
public File getBaseDownloadDir()
```

getCrawlerDownloadDir
```
public File getCrawlerDownloadDir()
```

getCrawlerEventManager
```
public CrawlerEventManager getCrawlerEventManager()
```
Description copied from interface: ICrawler

Gets the crawler events manager.

Specified by:

getCrawlerEventManager in interface ICrawler

Returns:

the events manager

startExecution

protected void startExecution(JobStatusUpdater statusUpdater,
                              JobSuite suite)

Specified by:: startExecution in class AbstractResumableJob

resumeExecution

protected void resumeExecution(JobStatusUpdater statusUpdater,
                               JobSuite suite)

Specified by:: resumeExecution in class AbstractResumableJob

createCrawlDataStore

protected ICrawlDataStore createCrawlDataStore(boolean resume)

prepareExecution

protected abstract void prepareExecution(JobStatusUpdater statusUpdater,
                                         JobSuite suite,
                                         ICrawlDataStore refStore,
                                         boolean resume)

cleanupExecution

protected abstract void cleanupExecution(JobStatusUpdater statusUpdater,
                                         JobSuite suite,
                                         ICrawlDataStore refStore)

execute

protected void execute(JobStatusUpdater statusUpdater,
                       JobSuite suite,
                       ICrawlDataStore crawlDataStore)

handleOrphans

protected void handleOrphans(ICrawlDataStore crawlStore,
                             JobStatusUpdater statusUpdater,
                             JobSuite suite)

isMaxDocuments
```
protected boolean isMaxDocuments()
```

reprocessCacheOrphans

protected void reprocessCacheOrphans(ICrawlDataStore crawlDataStore,
                                     JobStatusUpdater statusUpdater,
                                     JobSuite suite)

executeQueuePipeline

protected abstract void executeQueuePipeline(ICrawlData crawlData,
                                             ICrawlDataStore crawlDataStore)

deleteCacheOrphans

protected void deleteCacheOrphans(ICrawlDataStore crawlDataStore,
                                  JobStatusUpdater statusUpdater,
                                  JobSuite suite)

processReferences

protected void processReferences(JobStatusUpdater statusUpdater,
                                 JobSuite suite,
                                 ImporterPipelineContext contextPrototype)

processNextReference

protected boolean processNextReference(JobStatusUpdater statusUpdater,
                                       ImporterPipelineContext context)

wrapDocument

protected abstract ImporterDocument wrapDocument(ICrawlData crawlData,
                                                 ImporterDocument document)

initCrawlData

protected void initCrawlData(ICrawlData crawlData,
                             ICrawlData cachedCrawlData,
                             ImporterDocument document)

beforeFinalizeDocumentProcessing
```
protected void beforeFinalizeDocumentProcessing(BaseCrawlData crawlData,
                                                ICrawlDataStore store,
                                                ImporterDocument doc,
                                                ICrawlData cachedCrawlData)
```
Gives implementors a change to take action on a document before its processing is being finalized (cycle end-of-life for a crawled reference). Default implementation does nothing.

Parameters:

crawlData - crawl data with data the crawler was able to obtain, guaranteed to have a non-null state

store - crawl store

doc - the document

cachedCrawlData - cached crawl data (null if document was not crawled before)

markReferenceVariationsAsProcessed

protected abstract void markReferenceVariationsAsProcessed(BaseCrawlData crawlData,
                                                           ICrawlDataStore refStore)

createEmbeddedCrawlData

protected abstract BaseCrawlData createEmbeddedCrawlData(String embeddedReference,
                                                         ICrawlData parentCrawlData)

executeImporterPipeline

protected abstract ImporterResponse executeImporterPipeline(ImporterPipelineContext context)

executeCommitterPipeline

protected abstract void executeCommitterPipeline(ICrawler crawler,
                                                 ImporterDocument doc,
                                                 ICrawlDataStore crawlDataStore,
                                                 BaseCrawlData crawlData,
                                                 BaseCrawlData cachedCrawlData)

Class AbstractCrawler

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class com.norconex.jef4.job.AbstractResumableJob

Methods inherited from class java.lang.Object

Methods inherited from interface com.norconex.jef4.job.IJob

Constructor Detail

AbstractCrawler

Method Detail

getId

isStopped

stop

getImporter

getStreamFactory

getCrawlerConfig

fireCrawlerEvent

getBaseDownloadDir

getCrawlerDownloadDir

getCrawlerEventManager

startExecution

resumeExecution

createCrawlDataStore

prepareExecution

cleanupExecution

execute

handleOrphans

isMaxDocuments

reprocessCacheOrphans

executeQueuePipeline

deleteCacheOrphans

processReferences

processNextReference

wrapDocument

initCrawlData

beforeFinalizeDocumentProcessing

markReferenceVariationsAsProcessed

createEmbeddedCrawlData

executeImporterPipeline

executeCommitterPipeline