Crawl web content
Collect web sites content for your
search engine or any other data repository. Run this full-featured
on its own, or embed it in your own application.
Works on any operating system,
is fully documented and is packaged with sample crawl configurations
running out-of-the-box to get you started quickly.
Norconex HTTP Collector shares common features with other Norconex
Collectors. Find out about those
The following is a non exhaustive list of features supported by
the Norconex HTTP Collector:
- Supports full and incremental crawls.
- Supports different hit interval according to different schedules.
- Can crawls millions on a single server of average capacity.
- Extract text out of many file formats (HTML, PDF, Word, etc.)
- Extract metadata associated with documents.
- Language detection.
- Many content and metadata manipulation options.
- OCR support on images and PDFs.
- Page screenshots.
- Extract page "featured" image.
- Translation support.
- Dynamic title generation.
- Configurable crawling speed.
- URL normalization.
- Detects modified and deleted documents.
- Supports different frequencies for re-crawling certain pages.
- Supports various web site authentication schemes.
- Supports sitemap.xml (including "lastmod" and "changefreq").
- Supports robot rules.
- Supports canonical URLs.
- Can filter documents based on URL, HTTP headers, content, or metadata.
- Can treat embedded documents as distinct documents.
- Can split a document into multiple documents.
- Can store crawled URLs in different database engines.
- Can re-process or delete URLs no longer linked by other crawled pages.
- Supports different URL extraction strategies for different content types.
- Fires more than 20 crawler event types for custom event listeners.
- Date parsers/formatters to match your source/target repository dates.
- Can create hierarchical fields.
- Supports scripting languages for manipulating documents.
- Reference XML/HTML elements using simple DOM tree navigation.
- Supports external commands to parse or manipulate documents.
- Supports crawling with your favorite browser (using WebDriver).
- Supports "If-Modified-Since" for more efficient crawling.
- Follow URLs from HTML or any other document format.
- Can detects and report broken links.
- Can send crawled content to multiple target repositories at once.
- Many others.