Crawl web content
Use Norconex open-source enterprise web crawler to collect web sites
content for your search engine or any other data repository.
Run it on its own, or embed it in your own
application.
Works on any operating system,
is fully documented and is packaged with sample crawl configurations
running out-of-the-box to get you started quickly.
Features
There are
multiple reasons
for using Norconex Web Crawler. The following is a partial list of features:
- Multi-threaded.
- Supports full and incremental crawls.
- Supports different hit interval according to different schedules.
- Can crawls millions on a single server of average capacity.
- Extract text out of many file formats (HTML, PDF, Word, etc.)
- Extract metadata associated with documents.
- Supports pages rendered with JavaScript.
- Supports deduplication of crawled documents.
- Language detection.
- Many content and metadata manipulation options.
- OCR support on images and PDFs.
- Page screenshots.
- Extract page "featured" image.
- Translation support.
- Dynamic title generation.
- Configurable crawling speed.
- URL normalization.
- Detects modified and deleted documents.
- Supports different frequencies for re-crawling certain pages.
- Supports various web site authentication schemes.
- Supports sitemap.xml (including "lastmod" and "changefreq").
- Supports robot rules.
- Supports canonical URLs.
- Can filter documents based on URL, HTTP headers, content, or metadata.
- Can treat embedded documents as distinct documents.
- Can split a document into multiple documents.
- Can store crawled URLs in different database engines.
- Can re-process or delete URLs no longer linked by other crawled pages.
- Supports different URL extraction strategies for different content types.
- Fires many crawler event types for custom event listeners.
- Date parsers/formatters to match your source/target repository dates.
- Can create hierarchical fields.
- Supports scripting languages for manipulating documents.
- Reference XML/HTML elements using simple DOM tree navigation.
- Supports external commands to parse or manipulate documents.
- Supports crawling with your favorite browser (using WebDriver).
-
Supports several HTTP standards for more efficient crawling, such as:
If-Modified-Since
, ETag
,
If-None-Match
, HSTS
and more.
- Follow URLs from HTML or any other document format.
- Can detects and report broken links.
- Can send crawled content to multiple target repositories at once.
- Offer monitoring via JMX (e.g., Prometheus).
- Many others.