Norconex Web Crawler

Open-Source Enterprise Crawler (AKA Norconex HTTP Collector)

Documentation Download

Crawl web content

Use Norconex open-source enterprise web crawler to collect web sites content for your search engine or any other data repository. Run it on its own, or embed it in your own application. Works on any operating system, is fully documented and is packaged with sample crawl configurations running out-of-the-box to get you started quickly.


Features

There are multiple reasons for using Norconex Web Crawler. The following is a partial list of features:

  • Multi-threaded.
  • Supports full and incremental crawls.
  • Supports different hit interval according to different schedules.
  • Can crawls millions on a single server of average capacity.
  • Extract text out of many file formats (HTML, PDF, Word, etc.)
  • Extract metadata associated with documents.
  • Supports pages rendered with JavaScript.
  • Supports deduplication of crawled documents.
  • Language detection.
  • Many content and metadata manipulation options.
  • OCR support on images and PDFs.
  • Page screenshots.
  • Extract page "featured" image.
  • Translation support.
  • Dynamic title generation.
  • Configurable crawling speed.
  • URL normalization.
  • Detects modified and deleted documents.
  • Supports different frequencies for re-crawling certain pages.
  • Supports various web site authentication schemes.
  • Supports sitemap.xml (including "lastmod" and "changefreq").
  • Supports robot rules.
  • Supports canonical URLs.
  • Can filter documents based on URL, HTTP headers, content, or metadata.
  • Can treat embedded documents as distinct documents.
  • Can split a document into multiple documents.
  • Can store crawled URLs in different database engines.
  • Can re-process or delete URLs no longer linked by other crawled pages.
  • Supports different URL extraction strategies for different content types.
  • Fires many crawler event types for custom event listeners.
  • Date parsers/formatters to match your source/target repository dates.
  • Can create hierarchical fields.
  • Supports scripting languages for manipulating documents.
  • Reference XML/HTML elements using simple DOM tree navigation.
  • Supports external commands to parse or manipulate documents.
  • Supports crawling with your favorite browser (using WebDriver).
  • Supports several HTTP standards for more efficient crawling, such as: If-Modified-Since, ETag, If-None-Match, HSTS and more.
  • Follow URLs from HTML or any other document format.
  • Can detects and report broken links.
  • Can send crawled content to multiple target repositories at once.
  • Offer monitoring via JMX (e.g., Prometheus).
  • Many others.

Latest news

Norconex Web Crawler 3.0.0 Released!
2022-01-05
The new major release of Norconex HTTP Collector it is finally here. Check out what's new. More...

Norconex HTTP Collector 3.0.0 Release Candidate 1
2021-10-10
Is this the last pre-release? Put it to the test and let us know! Includes applicable release candidates of core dependencies as well. More...

Norconex HTTP Collector 3.0.0 Milestone 2
2021-07-28
3.0.0 second milestone release. Includes applicable milestone releases of core dependencies as well. More...

Norconex HTTP Collector 3.0.0 Milestone 1
2021-03-01
A step closer to final release. Available with milestone releases of core dependencies as well. More...

Norconex HTTP Collector 3.0.0 snapshots available
2020-09-07
Development builds of upcoming version 3 now available to experiment with. More...