To get started quickly, download the latest version of Norconex Web Crawler and locate the file ./examples/HOWTO_RUN_EXAMPLES.txt. This file will point you to functional configuration files and will have you running a very simple crawler in no time. See how you run these examples here. They are replicated here for your convinience:
To get the full potential of Norconex Web Crawler and learn which parts can easily be extended, refer to the following for an XML-based configuration. Entries with a "class" attribute are expecting an implementation of your choice. The Web Crawler API offers several concrete implementations already. Configuration options with a default value does not have to be defined. Developers can also create their own by implementing the proper Java interfaces. Refer to the Norconex Web Crawler JavaDoc and/or see further down what interfaces you can implement. Go to the Extend the Web Crawler page for more details on adding your own implementations.
Click on a tag name to jump to/from its documentation.
<httpcollector id="..."> <progressDir>...</progressDir> <logsDir>...</logsDir> <collectorListeners> <listener class="..."/> ... </collectorListeners> <crawlerDefaults> <!-- All crawler options defined below (except for the crawler "id") can be set here as default shared between multiple crawlers. Configuration blocks defined for a specific crawler always takes precendence. --> </crawlerDefaults> <crawlers> <!-- You need to define at least one crawler. --> <crawler id="..."> <startURLs stayOnDomain="..." includeSubdomains="..." stayOnPort="..." stayOnProtocol="..."> <url>...</url> <urlsFile>...</urlsFile> <sitemap>...</sitemap> <provider class="..."/> ... </startURLs> <userAgent>...</userAgent> <urlNormalizer class="..." /> <delay class="..." /> <numThreads>...</numThreads> <maxDepth>...</maxDepth> <maxDocuments>...</maxDocuments> <workDir>...</workDir> <keepDownloads>...</keepDownloads> <keepOutOfScopeLinks>...</keepOutOfScopeLinks> <orphansStrategy>...</orphansStrategy> <stopOnExceptions> <exception>...</exception> ... </stopOnExceptions> <crawlerListeners> <listener class="..."/> ... </crawlerListeners> <crawlDataStoreFactory class="" /> <httpClientFactory class="..." /> <referenceFilters> <filter class="..." /> ... </referenceFilters> <robotsTxt ignore="..." class="..."/> <sitemapResolverFactory ignore="..." class="..."> ... </sitemapResolverFactory> <redirectURLProvider class="..." /> <recrawlableResolver class="..." /> <metadataFetcher class="..." /> <metadataFilters> <filter class="..." /> ... </metadataFilters> <canonicalLinkDetector ignore="..." class="..." /> <metadataChecksummer class="..." /> <documentFetcher class="..." /> <robotsMeta ignore="..." class="..."/> <linkExtractors> <extractor class="..." /> </linkExtractors> <documentFilters> <filter class="..." /> ... </documentFilters> <preImportProcessors> <processor class="..." /> ... </preImportProcessors> <importer> <!-- refer to Importer documentation --> </importer> <documentChecksummer class="..." /> <postImportProcessors> <processor class="..."></processor> </postImportProcessors> <spoiledReferenceStrategizer class="..." /> <committer class="..."> <!-- refer to Committer documentation --> </committer> </crawler> ... </crawlers> </httpcollector>
The table below lists interface names that you can easily extend, and
also lists available out-of-the-box implementations.
In the configuration file, you have to use the fully qualified name,
as defined in the Javadoc. Click on a class or interface name to go directly
to its full documentation, with extra configuration options.
When a default implementation exists for a configuration option taking
a class
attribute, it is
highlighted.
Tag | Description | Classes | Interface |
---|---|---|---|
httpcollector | Root tag, you must give your configuration a unique identifier value. | N/A | N/A |
progressDir | Directory where to store crawling progress files. Default is "./progress". | N/A | N/A |
logsDir | Directory where crawl logs will be stored. Default is "./logs". | N/A | N/A |
collectorListeners | Listen to collector events. | ICollectorLifeCycleListener | |
startURLs | URLs to start crawling from. Can be one or several of <url>, <urlsFile> (file containing URLs), <sitemap> (URL to a sitemap.xml file) or <provider> (for a dynamically generated list). | N/A | IStartURLsProvider (<provider> only) |
userAgent | The crawler "User-Agent" value to identify your crawler to sites you crawl. | N/A | N/A |
urlNormalizer | Normalizes incoming URLs. | GenericURLNormalizer | IURLNormalizer |
delay | Handles interval between each page download. | GenericDelayResolver, ReferenceDelayResolver | IDelayResolver |
numThreads | Number of execution threads for a crawler. Default is 2. | N/A | N/A |
maxDepth | How many level deep to crawl from start URL(s). Default is -1 (unlimited). | N/A | N/A |
maxDocuments | Maximum documents to successfully process. Default is -1 (unlimited). | N/A | N/A |
workDir | Where to store files created as part of crawling activies. Default is "./work". | N/A | N/A |
keepDownloads | Whether to keep downloaded files. Defaut is false. | N/A | N/A |
keepOutOfScopeLinks | Keep extracted links that are out-of-scope according to start URL stayOn... flags. Defaut is false. | N/A | N/A |
orphansStrategy | What to do with urls not being referenced anymore. PROCESS (default), IGNORE, or DELETE. | N/A | N/A |
stopOnExceptions | What exception(s) should force a crawler to stop when triggered during the processing of a document. | N/A | N/A |
crawlerListeners | Listen to crawling events. | URLStatusCrawlerEventListener | ICrawlerEventListener |
crawlDataStoreFactory | URLs and crawl-related information data store. | MVStoreCrawlDataStoreFactory, JDBCCrawlDataStoreFactory, MongoCrawlDataStoreFactory | ICrawlDataStoreFactory |
httpClientFactory | HTTP Client creation and initialization. | GenericHttpClientFactory | IHttpClientFactory |
referenceFilters | Filter based on references (i.e. URLs). | ExtensionReferenceFilter, RegexReferenceFilter, SegmentCountURLFilter | IReferenceFilter |
robotsTxt | Handle robots.txt files. | StandardRobotsTxtProvider | IRobotsTxtProvider |
sitemapResolverFactory | Handle Sitemap files. | StandardSitemapResolverFactory | ISitemapResolverFactory |
redirectURLProvider | Provides the target URL to use when a redirect is encountered. | GenericRedirectURLProvider | IRedirectURLProvider |
recrawlableResolver | Indicates if a target URL is ready for recrawl or not. | GenericRecrawlableResolver | IRecrawlableResolver |
metadataFetcher | Fetches HTTP headers for a URL. | GenericMetadataFetcher | IHttpMetadataFetcher |
metadataFilters | Filter based on HTTP Headers. | ExtensionReferenceFilter, RegexReferenceFilter, RegexMetadataFilter, SegmentCountURLFilter | IMetadataFilter |
canonicalLinkDetector | Detects pages with a canonical link and rejects them in favor of the canonical one. | GenericCanonicalLinkDetector | ICanonicalLinkDetector |
metadataChecksummer | Create document checksum from HTTP Headers. | LastModifiedMetadataChecksummer, GenericMetadataChecksummer | IMetadataChecksummer |
documentFetcher | Fetch a document from URL. |
GenericDocumentFetcher, PhantomJSDocumentFetcher |
IHttpDocumentFetcher |
robotsMeta | Handle in-page robots instructions. | StandardRobotsMetaProvider | IRobotsMetaProvider |
linkExtractors | One or more <extractor> for extracting URLs and other link data from a document. |
GenericLinkExtractor, TikaLinkExtractor, RegexLinkExtractor, XMLFeedLinkExtractor |
ILinkExtractor |
documentFilters | Filter documents (after links were extracted). |
ExtensionReferenceFilter, RegexReferenceFilter, RegexMetadataFilter, SegmentCountURLFilter |
IDocumentFilter |
preImportProcessors | Process a document before import. | FeaturedImageProcessor | IHttpDocumentProcessor |
importer | Performs document text extraction and manipulation. It has many features and many file formats are supported. Refer to Importer configuration options. | ||
documentChecksummer | Create a checksum from document. | MD5DocumentChecksummer | IDocumentChecksummer |
postImportProcessors | Process a document after import. | IHttpDocumentProcessor | |
spoiledReferenceStrategizer | Establish the strategy to adopt for references that have turned bad. | GenericSpoiledReferenceStrategizer | ISpoiledReferenceStrategizer |
committer | Where to commit a document when processed. Different implementations are available. Check out the list of available Committers. | ||
crawler | Define as many crawlers as you like. They must each have a unique identifier. | N/A | N/A |