Norconex Crawler Features

Universal Crawlers

Crawled documents can be submitted to any search engine or other data repositories using the appropriate Committer. Create your own Committer or download one readily available from a growing list (Solr, Elasticsearch, Amazon CloudSearch, Azure Search, ...).

Easy for developers to extend

Build with developers in mind. Almost every feature can be replaced or enhanced by your own implementation. New features can also be added without the need to learn yet another plugin framework. Expected to be extended often by integrators.

Commercially supported

Norconex offers Commercial support for its Crawlers. Community support is also available via Github.

Obtain and modify document metadata

All metadata associated with documents are extracted by default (document properties, HTTP headers, file properties, ...). You can easily add, modify, remove, rename, ...

Easy to run

Fully documented and works out-of-the-box with sample XML configurations you can modify to suit your needs.

Embeddable

Can be integrated and run from your java applications (while still using file-based configuration or not).

Ease of maintenance

Break your configuration the way you like in configuration fragments, and share entire configuration sections between many crawler instances. Only one central copy of the crawler binaries is required and/or needs to be updated. No more duplication of binaries or identical configuration values across several installations.

Resumable upon system failure

Upon a major crash (e.g. server shutdown), you can resume a Crawler and it will pick-up exactly where it left. You can also stop it and resume it yourself.

Broken down into reusable modules

Crawlers use modules such as Norconex Importer and Norconex Committer which can both be used on their own for different purposes (e.g. to build your own crawler).

Cross-platform

100% Java-based. Runs on Windows, Linux, Unix, Mac, and any other operating system supporting Java. Your installation can be moved from an operating system to another without issues.

Portable

Ensures clean and easy deployments to different locations (staging, prod, etc) by letting you store environment-specific configuration settings (IP addresses, paths, URLs, ...) in isolated variable files.

Open-Source

Licensed under business-friendly Apache License 2.0. The source code of very high quality is available for download on GitHub.

Powerful

Tested on many sites with millions of documents. Crawl what you want, where you want. Allows for any kinds of document manipulation, before of after their text is extracted.

Easy for non-technical people to use

While developers will see more benefits, the out-of-the-box features are well-documented and easy to configure for non-developers.

Good documentation

Besides examples provided on crawler pages or in distributed packages, up-to-date Javadocs are automatically generated with every release. They constitute most of the documentation, each of them containing XML configuration options where applicable.

Event listeners

Write listeners for most crawling events, allowing you to build your own add-on solutions, such as custom crawling reports.

Logs are meaningful and verbose

Can log every document failure with exact cause of failure. Uses log4j to control log level.

Flexible

There are more than one way to do something. Experiment and have fun!

Why use Norconex Crawlers?