Crawler (Norconex Collector Core 2.0.2 API)

java.lang.Object
- com.norconex.collector.core.crawler.Crawler

```
public abstract class Crawler
extends Object
```
Abstract crawler implementation providing a common base to building crawlers.

As of 1.6.1, JMX support is disabled by default. To enable it, set the system property "enableJMX" to true. You can do so by adding this to your Java launch command:
```
     -DenableJMX=true
 
```
Author:

Pascal Essiembre

See Also:

CrawlerConfig

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

protected static class Crawler.ReferenceProcessStatus

Nested Classes
Modifier and Type	Class and Description
`protected static class`	`Crawler.ReferenceProcessStatus`

Constructor Summary

Constructors
Constructor and Description

Crawler(CrawlerConfig config, Collector collector)
Constructor.

Constructors
Constructor and Description
`Crawler(CrawlerConfig config, Collector collector)` Constructor.

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`protected abstract void`	`afterCrawlerExecution()` Gives crawler implementations a chance to do something right after the crawler is done processing its last reference, before all resources are shut down.
`protected abstract void`	`beforeCrawlerExecution(boolean resume)` Gives crawler implementations a chance to prepare before execution starts Invoked right after the `CrawlerEvent.CRAWLER_RUN_BEGIN` is fired.
`protected void`	`beforeFinalizeDocumentProcessing(CrawlDoc doc)` Gives implementors a change to take action on a document before its processing is being finalized (cycle end-of-life for a crawled reference).
`void`	`clean()`
`protected abstract CrawlDocInfo`	`createChildDocInfo(String embeddedReference, CrawlDocInfo parentCrawlRef)`
`protected void`	`deleteCacheOrphans()`
`protected void`	`destroyCrawler()`
`protected void`	`doExecute()`
`protected abstract void`	`executeCommitterPipeline(Crawler crawler, CrawlDoc doc)`
`protected abstract ImporterResponse`	`executeImporterPipeline(ImporterPipelineContext context)`
`protected abstract void`	`executeQueuePipeline(CrawlDocInfo ref)`
`Path`	`exportDataStore(Path dir)`
`Collector`	`getCollector()`
`CrawlerCommitterService`	`getCommitterService()`
`protected Class<? extends CrawlDocInfo>`	`getCrawlDocInfoType()`
`CrawlerConfig`	`getCrawlerConfig()` Gets the crawler configuration.
`IDataStoreEngine`	`getDataStoreEngine()`
`CrawlDocInfoService`	`getDocInfoService()`
`Path`	`getDownloadDir()`
`EventManager`	`getEventManager()` Gets the event manager.
`String`	`getId()`
`Importer`	`getImporter()` Gets the crawler Importer module.
`CrawlerMonitor`	`getMonitor()`
`CachedStreamFactory`	`getStreamFactory()`
`Path`	`getTempDir()` Gets the directory where most temporary files are created for the duration of a crawling session.
`Path`	`getWorkDir()` Gets the directory where files needing to be persisted between crawling sessions are kept.
`protected void`	`handleOrphans()`
`void`	`importDataStore(Path inFile)`
`protected void`	`initCrawlDoc(CrawlDoc document)`
`protected boolean`	`initCrawler()`
`protected boolean`	`isMaxDocuments()`
`protected boolean`	`isQueueInitialized()`
`boolean`	`isStopped()` Whether the crawler job was stopped.
`protected abstract void`	`markReferenceVariationsAsProcessed(CrawlDocInfo crawlRef)`
`protected Crawler.ReferenceProcessStatus`	`processNextReference(com.norconex.collector.core.crawler.Crawler.ProcessFlags flags)`
`protected void`	`processReferences(com.norconex.collector.core.crawler.Crawler.ProcessFlags flags)`
`protected void`	`reprocessCacheOrphans()`
`void`	`start()` Starts crawling.
`void`	`stop()`
`String`	`toString()`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Constructor Detail
- Crawler
```
public Crawler(CrawlerConfig config,
               Collector collector)
```
  Constructor.
  
  Parameters:
  
  config - crawler configuration
  
  collector - the collector this crawler is attached to

Method Detail

getEventManager
```
public EventManager getEventManager()
```
Gets the event manager.

Returns:

event manager

Since:

2.0.0

getMonitor
```
public CrawlerMonitor getMonitor()
```

getCommitterService

public CrawlerCommitterService getCommitterService()

getId
```
public String getId()
```

isStopped
```
public boolean isStopped()
```
Whether the crawler job was stopped.

Returns:

true if stopped

stop
```
public void stop()
```

getImporter
```
public Importer getImporter()
```
Gets the crawler Importer module.

Returns:

the Importer

getStreamFactory

public CachedStreamFactory getStreamFactory()

getCrawlerConfig
```
public CrawlerConfig getCrawlerConfig()
```
Gets the crawler configuration.

Returns:

the crawler configuration

getCollector
```
public Collector getCollector()
```

getWorkDir
```
public Path getWorkDir()
```
Gets the directory where files needing to be persisted between crawling sessions are kept.

Returns:

working directory, never null

getTempDir
```
public Path getTempDir()
```
Gets the directory where most temporary files are created for the duration of a crawling session. Those files are typically deleted after a crawling session.

Returns:

temporary directory, never null

getDownloadDir
```
public Path getDownloadDir()
```

start
```
public void start()
```
Starts crawling.

initCrawler
```
protected boolean initCrawler()
```

getCrawlDocInfoType

protected Class<? extends CrawlDocInfo> getCrawlDocInfoType()

getDataStoreEngine

public IDataStoreEngine getDataStoreEngine()

getDocInfoService

public CrawlDocInfoService getDocInfoService()

clean
```
public void clean()
```

importDataStore

public void importDataStore(Path inFile)

exportDataStore

public Path exportDataStore(Path dir)

destroyCrawler
```
protected void destroyCrawler()
```

beforeCrawlerExecution
```
protected abstract void beforeCrawlerExecution(boolean resume)
```
Gives crawler implementations a chance to prepare before execution starts Invoked right after the CrawlerEvent.CRAWLER_RUN_BEGIN is fired. This method is different than the initCrawler() method, which is invoked for any type of actions where as this one is only invoked before an effective request for crawling.

Parameters:

resume - whether the crawl is resuming from an unfinished session.

afterCrawlerExecution
```
protected abstract void afterCrawlerExecution()
```
Gives crawler implementations a chance to do something right after the crawler is done processing its last reference, before all resources are shut down. Invoked right after CrawlerEvent.CRAWLER_STOP_END or CrawlerEvent.CRAWLER_RUN_END (depending which of the two is triggered).

doExecute
```
protected void doExecute()
```

handleOrphans
```
protected void handleOrphans()
```

isMaxDocuments
```
protected boolean isMaxDocuments()
```

reprocessCacheOrphans

protected void reprocessCacheOrphans()

executeQueuePipeline

protected abstract void executeQueuePipeline(CrawlDocInfo ref)

deleteCacheOrphans
```
protected void deleteCacheOrphans()
```

processReferences

protected void processReferences(com.norconex.collector.core.crawler.Crawler.ProcessFlags flags)

processNextReference

protected Crawler.ReferenceProcessStatus processNextReference(com.norconex.collector.core.crawler.Crawler.ProcessFlags flags)

initCrawlDoc

protected void initCrawlDoc(CrawlDoc document)

beforeFinalizeDocumentProcessing
```
protected void beforeFinalizeDocumentProcessing(CrawlDoc doc)
```
Gives implementors a change to take action on a document before its processing is being finalized (cycle end-of-life for a crawled reference). Default implementation does nothing.

Parameters:

doc - the document

markReferenceVariationsAsProcessed

protected abstract void markReferenceVariationsAsProcessed(CrawlDocInfo crawlRef)

createChildDocInfo

protected abstract CrawlDocInfo createChildDocInfo(String embeddedReference,
                                                   CrawlDocInfo parentCrawlRef)

executeImporterPipeline

protected abstract ImporterResponse executeImporterPipeline(ImporterPipelineContext context)

executeCommitterPipeline

protected abstract void executeCommitterPipeline(Crawler crawler,
                                                 CrawlDoc doc)

isQueueInitialized

protected boolean isQueueInitialized()

toString
```
public String toString()
```
Overrides:

toString in class Object

Class Crawler

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

Crawler

Method Detail

getEventManager

getMonitor

getCommitterService

getId

isStopped

stop

getImporter

getStreamFactory

getCrawlerConfig

getCollector

getWorkDir

getTempDir

getDownloadDir

start

initCrawler

getCrawlDocInfoType

getDataStoreEngine

getDocInfoService

clean

importDataStore

exportDataStore

destroyCrawler

beforeCrawlerExecution

afterCrawlerExecution

doExecute

handleOrphans

isMaxDocuments

reprocessCacheOrphans

executeQueuePipeline

deleteCacheOrphans

processReferences

processNextReference

initCrawlDoc

beforeFinalizeDocumentProcessing

markReferenceVariationsAsProcessed

createChildDocInfo

executeImporterPipeline

executeCommitterPipeline

isQueueInitialized

toString