Class HttpCrawler
- java.lang.Object
-
- com.norconex.collector.core.crawler.Crawler
-
- com.norconex.collector.http.crawler.HttpCrawler
-
public class HttpCrawler extends Crawler
The HTTP Crawler.- Author:
- Pascal Essiembre
- See Also:
HttpCrawlerConfig
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class com.norconex.collector.core.crawler.Crawler
Crawler.ReferenceProcessStatus
-
-
Constructor Summary
Constructors Constructor Description HttpCrawler(HttpCrawlerConfig crawlerConfig, HttpCollector collector)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
afterCrawlerExecution()
protected void
beforeCrawlerExecution(boolean resume)
protected void
beforeFinalizeDocumentProcessing(CrawlDoc doc)
protected CrawlDocInfo
createChildDocInfo(String embeddedReference, CrawlDocInfo parentCrawlData)
protected void
executeCommitterPipeline(Crawler crawler, CrawlDoc doc)
protected ImporterResponse
executeImporterPipeline(ImporterPipelineContext importerContext)
protected void
executeQueuePipeline(CrawlDocInfo crawlRef)
protected Class<? extends CrawlDocInfo>
getCrawlDocInfoType()
HttpCrawlerConfig
getCrawlerConfig()
IDataStore<String>
getDedupDocumentStore()
IDataStore<String>
getDedupMetadataStore()
HttpFetchClient
getHttpFetchClient()
ISitemapResolver
getSitemapResolver()
protected void
initCrawlDoc(CrawlDoc doc)
protected boolean
isQueueInitialized()
protected void
markReferenceVariationsAsProcessed(CrawlDocInfo crawlRef)
-
Methods inherited from class com.norconex.collector.core.crawler.Crawler
clean, deleteCacheOrphans, destroyCrawler, doExecute, exportDataStore, getCollector, getCommitterService, getDataStoreEngine, getDocInfoService, getDownloadDir, getEventManager, getId, getImporter, getMonitor, getStreamFactory, getTempDir, getWorkDir, handleOrphans, importDataStore, initCrawler, isMaxDocuments, isStopped, processNextReference, processReferences, reprocessCacheOrphans, start, stop, toString
-
-
-
-
Constructor Detail
-
HttpCrawler
public HttpCrawler(HttpCrawlerConfig crawlerConfig, HttpCollector collector)
Constructor.- Parameters:
crawlerConfig
- HTTP crawler configurationcollector
- http collector this crawler belongs to
-
-
Method Detail
-
getCrawlerConfig
public HttpCrawlerConfig getCrawlerConfig()
- Overrides:
getCrawlerConfig
in classCrawler
-
getHttpFetchClient
public HttpFetchClient getHttpFetchClient()
-
getSitemapResolver
public ISitemapResolver getSitemapResolver()
- Returns:
- the sitemapResolver
-
getDedupMetadataStore
public IDataStore<String> getDedupMetadataStore()
-
getDedupDocumentStore
public IDataStore<String> getDedupDocumentStore()
-
isQueueInitialized
protected boolean isQueueInitialized()
- Overrides:
isQueueInitialized
in classCrawler
-
beforeCrawlerExecution
protected void beforeCrawlerExecution(boolean resume)
- Specified by:
beforeCrawlerExecution
in classCrawler
-
afterCrawlerExecution
protected void afterCrawlerExecution()
- Specified by:
afterCrawlerExecution
in classCrawler
-
executeQueuePipeline
protected void executeQueuePipeline(CrawlDocInfo crawlRef)
- Specified by:
executeQueuePipeline
in classCrawler
-
getCrawlDocInfoType
protected Class<? extends CrawlDocInfo> getCrawlDocInfoType()
- Overrides:
getCrawlDocInfoType
in classCrawler
-
initCrawlDoc
protected void initCrawlDoc(CrawlDoc doc)
- Overrides:
initCrawlDoc
in classCrawler
-
executeImporterPipeline
protected ImporterResponse executeImporterPipeline(ImporterPipelineContext importerContext)
- Specified by:
executeImporterPipeline
in classCrawler
-
createChildDocInfo
protected CrawlDocInfo createChildDocInfo(String embeddedReference, CrawlDocInfo parentCrawlData)
- Specified by:
createChildDocInfo
in classCrawler
-
executeCommitterPipeline
protected void executeCommitterPipeline(Crawler crawler, CrawlDoc doc)
- Specified by:
executeCommitterPipeline
in classCrawler
-
beforeFinalizeDocumentProcessing
protected void beforeFinalizeDocumentProcessing(CrawlDoc doc)
- Overrides:
beforeFinalizeDocumentProcessing
in classCrawler
-
markReferenceVariationsAsProcessed
protected void markReferenceVariationsAsProcessed(CrawlDocInfo crawlRef)
- Specified by:
markReferenceVariationsAsProcessed
in classCrawler
-
-