All Classes and Interfaces

Class
Description
Abstract implementation of IDocumentChecksummer giving the option to keep the generated checksum in a metadata field.
Abstract implementation of IMetadataChecksummer giving the option to keep the generated checksum.
Base IPipelineStage context for collector Pipelines.
Base class for subcommands.
Checksum stage utility methods.
Checksum utility methods.
Clean the Collector crawling history.
Base implementation of a Collector.
Encapsulates command line arguments when running the Collector from a command prompt.
Launches a collector implementation from a string array representing command line arguments.
Base Collector configuration.
A crawler event.
 
Runtime exception for most unrecoverable issues thrown by Collector classes.
Collector event listener adapter for collector startup/shutdown.
Exception thrown when a problem occurred while trying to stop a collector.
Common pipeline stage for committing documents.
Validate configuration file format and quit.
Resolve all includes and variables substitution and print the resulting configuration to facilitate sharing.
A crawl document, which holds an additional DocInfo from cache (if any).
 
 
 
Metadata constants for common metadata field names typically set by a collector crawler.
Abstract crawler implementation providing a common base to building crawlers.
 
Wrapper around multiple Committers so they can all be handled as one.
Base Crawler configuration.
 
HTTP Crawler configuration loader.
A crawler event.
 
Listener adapter for crawler events.
 
 
 
Reference processing status.
Crawl data store runtime exception.
Exports data stores to a format that can be imported back to the same or different store implementation.
Imports from a previously exported data store.
Provides the ability to send deletion requests to your configured committer(s) whenever a reference is rejected, regardless whether it was encountered in a previous crawling session or not.
A IPipelineStage context for collector Pipelines dealing with a CrawlDocInfo (e.g. document queuing).
Common pipeline stage for creating a document checksum.
 
IPipelineStage context for collector Pipelines dealing with an Doc.
Filters a reference based on a comma-separated list of extensions.
Listens for STOP requests using a stop file.
Generic implementation of IMetadataChecksummer that uses specified field names and their values to create a checksum.
Generic implementation of ISpoiledReferenceStrategizer that offers a simple mapping between the crawl state of references that have turned "bad" and the strategy to adopt for each.
Responsible for shutting down a Collector upon explicit invocation of ICollectorStopper.fireStopRequest(Collector) or when specific conditions are met.
 
 
Creates a checksum representing a a document.
Filter a document after the document content is fetched, downloaded, or otherwise read or acquired.
Creates a checksum representing a document based on document metadata values obtained prior to fetching that document (e.g.
Filter a reference based on the metadata that could be obtained for a document, before it was fetched, downloaded, or otherwise read or acquired (e.g.
IPipelineStage context for collector Pipelines dealing with ImporterResponse.
Common pipeline stage for importing documents.
Filter a document based on its reference, before its properties or content gets read or otherwise acquired.
Decides which strategy to adopt for a given reference with a bad state.
 
Data store engine using a JDBC-compatible database for storing crawl data.
Implementation of IDocumentChecksummer which returns a MD5 checksum value of the extracted document content unless one or more given source fields are specified, in which case the MD5 checksum value is constructed from those fields.
Utility methods to simplify adding Mapped Diagnostic Context (MDC) to logging in a consistent way for crawlers and collectors, as well as offering filename-friendly version as well.
Accepts or rejects a reference based on whether one or more metadata field values are matching.
 
Data store engine using MongoDB for storing crawl data.
 
MVStore configuration parameters.
 
Common pipeline stage for queuing documents.
Filters URL based on a regular expression.
Common pipeline stage for filtering references.
Reference-filtering stage utility methods.
Deprecated.
Since 2.0.0, use MetadataFilter instead.
Deprecated.
Since 2.0.0, use ReferenceFilter
Common pipeline stage for saving documents.
Markers indicating what to do with references that were once processed properly, but failed to get a good processing state a subsequent time around.
Start the Collector.
Stop the Collector.
Alternative to CrawlerConfig.setMaxDocuments(int) for stopping the crawler upon reaching specific event counts.
 
Export crawl store to specified file.
Import crawl store from specified file.