Norconex HTTP Collector is an open-source web crawler that navigates the web to downloads files and store extracted content into a target repository of your choice (e.g. search engine, database, etc.).
While most often used to crawl websites, it can also gather content from other HTTP sources such as RSS feeds, REST APIs, etc.
Unless otherwise stated, this documentation focuses on the crawling of website pages and documents.
# How Does it Work?
A website crawl is broken down into three major steps:
- Collect all documents were are interested in from a website.
- Import those documents, creating a normalized text version matching your requirements.
- Commit them into a target repository.
These collect, import, and commit phases are analogous to the parts of an ETL process: extract, transform, and load.
Collection is the part of the crawling process that extracts data from a given digital source for further processing. For the Norconex HTTP Collector, that source is an ensemble of web resources from one or more web sites. Those resources are typically HTML pages, but can also be any other downloadable resources such as PDFs, office documents, spreadsheets, XML files, RSS feeds, etc.
The Collector launches one ore more crawlers to download web files. Which files are being downloaded is established by a list of starting URLs (or a seed list), combined with a series of filtering rules and other configurable instructions.
The Collector's concern is to get the documents. Content transformation is left to the Importer module, while the storing of that transformation output into a target repository is the job of a Committer.
The main goal of the import process is to parse collected files and convert them from their original format to a normalized one, usually text-based, to make them ready to be sent to your target repository. This is achieved by the Norconex Importer module.
It does so by parsing files of any format to convert them to a plain-text version that hold both content and metadata, as performing a series of configurable transformations to modify or enrich the said content and associated metadata.
Once a document has been imported, it is send to your committer(s).
For web crawling to be beneficial, you need to store the obtained data somewhere. This is the job of a Norconex Committer.
A Committer understands how to communicate with your target repository to send data to be saved or deleted. Norconex offers a multitudes of Norconex Committers. While a few file-based ones such as XML and JSON Committers are provided out-of-the-box, there are many others you install separately, such as Solr, Elasticsearch, and SQL