Class HttpCrawler
- java.lang.Object
-
- com.norconex.collector.core.crawler.Crawler
-
- com.norconex.collector.http.crawler.HttpCrawler
-
public class HttpCrawler extends Crawler
The HTTP Crawler.- Author:
- Pascal Essiembre
- See Also:
HttpCrawlerConfig
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class com.norconex.collector.core.crawler.Crawler
Crawler.ReferenceProcessStatus
-
-
Constructor Summary
Constructors Constructor Description HttpCrawler(HttpCrawlerConfig crawlerConfig, HttpCollector collector)Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voidafterCrawlerExecution()protected voidbeforeCrawlerExecution(boolean resume)protected voidbeforeFinalizeDocumentProcessing(CrawlDoc doc)protected CrawlDocInfocreateChildDocInfo(String embeddedReference, CrawlDocInfo parentCrawlData)protected voidexecuteCommitterPipeline(Crawler crawler, CrawlDoc doc)protected ImporterResponseexecuteImporterPipeline(ImporterPipelineContext importerContext)protected voidexecuteQueuePipeline(CrawlDocInfo crawlRef)protected Class<? extends CrawlDocInfo>getCrawlDocInfoType()HttpCrawlerConfiggetCrawlerConfig()IDataStore<String>getDedupDocumentStore()IDataStore<String>getDedupMetadataStore()HttpFetchClientgetHttpFetchClient()ISitemapResolvergetSitemapResolver()protected voidinitCrawlDoc(CrawlDoc doc)protected booleanisQueueInitialized()protected voidmarkReferenceVariationsAsProcessed(CrawlDocInfo crawlRef)-
Methods inherited from class com.norconex.collector.core.crawler.Crawler
clean, deleteCacheOrphans, destroyCrawler, doExecute, exportDataStore, getCollector, getCommitterService, getDataStoreEngine, getDocInfoService, getDownloadDir, getEventManager, getId, getImporter, getMonitor, getStreamFactory, getTempDir, getWorkDir, handleOrphans, importDataStore, initCrawler, isMaxDocuments, isStopped, processNextReference, processReferences, reprocessCacheOrphans, start, stop, toString
-
-
-
-
Constructor Detail
-
HttpCrawler
public HttpCrawler(HttpCrawlerConfig crawlerConfig, HttpCollector collector)
Constructor.- Parameters:
crawlerConfig- HTTP crawler configurationcollector- http collector this crawler belongs to
-
-
Method Detail
-
getCrawlerConfig
public HttpCrawlerConfig getCrawlerConfig()
- Overrides:
getCrawlerConfigin classCrawler
-
getHttpFetchClient
public HttpFetchClient getHttpFetchClient()
-
getSitemapResolver
public ISitemapResolver getSitemapResolver()
- Returns:
- the sitemapResolver
-
getDedupMetadataStore
public IDataStore<String> getDedupMetadataStore()
-
getDedupDocumentStore
public IDataStore<String> getDedupDocumentStore()
-
isQueueInitialized
protected boolean isQueueInitialized()
- Overrides:
isQueueInitializedin classCrawler
-
beforeCrawlerExecution
protected void beforeCrawlerExecution(boolean resume)
- Specified by:
beforeCrawlerExecutionin classCrawler
-
afterCrawlerExecution
protected void afterCrawlerExecution()
- Specified by:
afterCrawlerExecutionin classCrawler
-
executeQueuePipeline
protected void executeQueuePipeline(CrawlDocInfo crawlRef)
- Specified by:
executeQueuePipelinein classCrawler
-
getCrawlDocInfoType
protected Class<? extends CrawlDocInfo> getCrawlDocInfoType()
- Overrides:
getCrawlDocInfoTypein classCrawler
-
initCrawlDoc
protected void initCrawlDoc(CrawlDoc doc)
- Overrides:
initCrawlDocin classCrawler
-
executeImporterPipeline
protected ImporterResponse executeImporterPipeline(ImporterPipelineContext importerContext)
- Specified by:
executeImporterPipelinein classCrawler
-
createChildDocInfo
protected CrawlDocInfo createChildDocInfo(String embeddedReference, CrawlDocInfo parentCrawlData)
- Specified by:
createChildDocInfoin classCrawler
-
executeCommitterPipeline
protected void executeCommitterPipeline(Crawler crawler, CrawlDoc doc)
- Specified by:
executeCommitterPipelinein classCrawler
-
beforeFinalizeDocumentProcessing
protected void beforeFinalizeDocumentProcessing(CrawlDoc doc)
- Overrides:
beforeFinalizeDocumentProcessingin classCrawler
-
markReferenceVariationsAsProcessed
protected void markReferenceVariationsAsProcessed(CrawlDocInfo crawlRef)
- Specified by:
markReferenceVariationsAsProcessedin classCrawler
-
-