Crawled documents can be submitted to any search engine or other data
repositories using the appropriate Universal CrawlersCommitter.
Create your own Committer or download one readily available from a
growing list (Solr, Elasticsearch, Amazon CloudSearch, Azure Search, ...).
Build with developers in mind.
Almost every feature can be replaced or enhanced
by your own implementation. New features can also be added without
the need to learn yet another plugin framework.
Expected to be extended often by integrators.
Easy for developers to extend
Norconex offers Commercial support for its Collectors. Community support
is also available via Github.
All metadata associated with documents are extracted by default
(document properties, HTTP headers, file properties, ...).
You can easily add, modify, remove, rename, ...
Obtain and modify document metadata
Fully documented and works out-of-the-box with sample XML configurations
you can modify to suit your needs.
Easy to run
Can be integrated and run from your java applications
(while still using file-based configuration or not).
Break your configuration the way you like in configuration fragments,
and share entire configuration sections between many crawler instances.
Only one central copy of the crawler binaries is required and/or needs to
be updated. No more duplication of binaries or identical configuration
values across several installations.
Ease of maintenance
Upon a major crash (e.g. server shutdown), you can resume a Collector and
it will pick-up exactly where it left. You can also stop it and resume
Resumable upon system failure
Collectors use modules such as Norconex Importer and Norconex Committer
which can both be used on their own for different purposes
(e.g. to build your own collectors).
Broken down into reusable modules
100% Java-based. Runs on Windows, Linux, Unix, Mac, and any other
operating system supporting Java.
Your installation can be moved from an operating system
to another without issues.
Ensures clean and easy deployments to different locations
(staging, prod, etc) by letting you store environment-specific
configuration settings (IP addresses, paths, URLs, ...)
in isolated variable files.
Licensed under business-friendly Apache License 2.0.
The source code of very high quality is available for download on GitHub.
Tested on many sites with millions of documents. Crawl what you want,
where you want. Allows for any kinds of document manipulation, before of
after their text is extracted.
While developers will see more benefits, the out-of-the-box features are
well-documented and easy to configure for non-developers.
Easy for non-technical people to use
Besides examples provided on collector pages or in distributed packages,
up-to-date Javadocs are automatically generated with every release.
They constitute most of the documentation, each of them containing XML
configuration options where applicable.
Write listeners for most crawling events, allowing you to build your own
add-on solutions, such as custom crawling reports.
Can log every document failure with exact cause of failure.
Uses log4j to control log level.
Logs are meaningful and verbose
There are more than one way to do something. Experiment and have fun!