# Introduction

Norconex Web Crawler is an open-source web crawler that navigates the web to downloads files and store extracted content into a target repository of your choice (e.g. search engine, database, etc.).

While most often used to crawl websites, it can also gather content from other HTTP sources such as RSS feeds, REST APIs, etc.

NOTE

Unless otherwise stated, this documentation focuses on the crawling of website pages and documents.

# The Collector

A Collector made of several crawlers

Figure: A Collector with multiple crawlers[1]

Prior to version 3.0.0, Norconex Web Crawler used to be referred to as Norconex HTTP Collector. You may see both "Collector" and "Crawler" used interchangeably. While they generally refer to the same thing, there is a distinction to be made to better understand this documentation.

The Collector is the main application (or parent process), which is made up of one or more Crawlers. Think of it as the Collector being responsible to "collect" data and in order to do so, it uses crawlers. Another way to look at it is to think of the Collector as a way to group multiple crawlers into a single process and a single configuration file.

When crawling many sites, some people will prefer having a single Crawler per Collector, where has others will have many crawlers under the Collector's responsibility. There are no set rules for this.

# The Crawler

Collect, Import, and Commit

Figure: main crawl steps[2]

A website crawl is broken down into three major steps:

  1. Collect all documents were are interested in from a website.
  2. Import those documents, creating a normalized text version matching your requirements.
  3. Commit them into a target repository.

These collect, import, and commit phases are analogous to the parts of an ETL process: extract, transform, and load.

# Collect

Collection is the part of the crawling process that extracts data from a given digital source for further processing. For the Norconex Web Crawler, that source is an ensemble of web resources from one or more web sites. Those resources are typically HTML pages, but can also be any other downloadable resources such as PDFs, office documents, spreadsheets, XML files, RSS feeds, etc.

The Collector launches one ore more crawlers to download web files. Which files are being downloaded is established by a list of starting URLs (or a seed list), combined with a series of filtering rules and other configurable instructions.

The Collector's concern is to get the documents. Content transformation is left to the Importer module, while the storing of that transformation output into a target repository is the job of a Committer.

# Import

The main goal of the import process is to parse collected files and convert them from their original format to a normalized one, usually text-based, to make them ready to be sent to your target repository. This is achieved by the Norconex Importer module.

It does so by parsing files of any format to convert them to a plain-text version that hold both content and metadata, as performing a series of configurable transformations to modify or enrich the said content and associated metadata.

Once a document has been imported, it is send to your committer(s).

# Commit

For web crawling to be beneficial, you need to store the obtained data somewhere. This is the job of a Norconex Committer.

A Committer understands how to communicate with your target repository to send data to be saved or deleted. Norconex offers a multitudes of Norconex Committers. While a few file-based ones such as XML and JSON Committers are provided out-of-the-box, there are many others you install separately, such as Solr, Elasticsearch, and SQL.


  1. Image contains file icons made by Zlatko Najdenovski from www.flaticon.com ↩︎

  2. Image contains file icons made by Zlatko Najdenovski from www.flaticon.com ↩︎

Last Updated: 10/14/2021, 4:44:32 PM