All Classes and Interfaces
Class
Description
Abstract implementation of
IDocumentChecksummer giving the option
to keep the generated checksum in a metadata field.Abstract implementation of
IMetadataChecksummer giving the option
to keep the generated checksum.Base
IPipelineStage context for collector Pipelines.Base class for subcommands.
Checksum stage utility methods.
Checksum utility methods.
Clean the Collector crawling history.
Base implementation of a Collector.
Encapsulates command line arguments when running the Collector from
a command prompt.
Launches a collector implementation from a string array representing
command line arguments.
Base Collector configuration.
A crawler event.
Runtime exception for most unrecoverable issues thrown by Collector
classes.
Collector event listener adapter for collector startup/shutdown.
Exception thrown when a problem occurred while trying to stop
a collector.
Common pipeline stage for committing documents.
Validate configuration file format and quit.
Resolve all includes and variables substitution and print the
resulting configuration to facilitate sharing.
A crawl document, which holds an additional
DocInfo from cache
(if any).Metadata constants for common metadata field
names typically set by a collector crawler.
Abstract crawler implementation providing a common base to building
crawlers.
Wrapper around multiple Committers so they can all be handled as one.
Base Crawler configuration.
HTTP Crawler configuration loader.
A crawler event.
Listener adapter for crawler events.
Reference processing status.
Crawl data store runtime exception.
Exports data stores to a format that can be imported back to the same
or different store implementation.
Imports from a previously exported data store.
Provides the ability to send deletion requests to your configured
committer(s) whenever a reference is rejected, regardless whether it was
encountered in a previous crawling session or not.
A
IPipelineStage context for collector Pipelines dealing
with a CrawlDocInfo (e.g. document queuing).Common pipeline stage for creating a document checksum.
Filters a reference based on a comma-separated list of extensions.
Listens for STOP requests using a stop file.
Generic implementation of
IMetadataChecksummer that uses
specified field names and their values to create a checksum.
Generic implementation of
ISpoiledReferenceStrategizer that
offers a simple mapping between the crawl state of references that have
turned "bad" and the strategy to adopt for each.
Responsible for shutting down a Collector upon explicit invocation
of
ICollectorStopper.fireStopRequest(Collector) or when specific conditions are met.
Creates a checksum representing a a document.
Filter a document after the document content is fetched, downloaded,
or otherwise read or acquired.
Creates a checksum representing a document based on document metadata
values obtained prior to fetching that document (e.g.
Filter a reference based on the metadata that could be obtained for a
document, before it was fetched, downloaded, or otherwise read or acquired
(e.g.
Common pipeline stage for importing documents.
Filter a document based on its reference, before its properties or content
gets read or otherwise acquired.
Decides which strategy to adopt for a given reference with a bad state.
Data store engine using a JDBC-compatible database for storing
crawl data.
Implementation of
IDocumentChecksummer which
returns a MD5 checksum value of the extracted document content unless
one or more given source fields are specified, in which case the MD5
checksum value is constructed from those fields.Utility methods to simplify adding Mapped Diagnostic Context (MDC) to
logging in a consistent way for crawlers and collectors, as well as
offering filename-friendly version as well.
Accepts or rejects a reference based on whether one or more
metadata field values are matching.
Data store engine using MongoDB for storing crawl data.
MVStore configuration parameters.
Common pipeline stage for queuing documents.
Filters URL based on a regular expression.
Common pipeline stage for filtering references.
Reference-filtering stage utility methods.
Deprecated.
Deprecated.
Since 2.0.0, use
ReferenceFilterCommon pipeline stage for saving documents.
Markers indicating what to do with references that were once processed
properly, but failed to get a good processing state a subsequent time around.
Start the Collector.
Stop the Collector.
Alternative to
CrawlerConfig.setMaxDocuments(int) for stopping
the crawler upon reaching specific event counts.Export crawl store to specified file.
Import crawl store from specified file.
MetadataFilterinstead.