Norconex Web Crawler

Getting Started

Command Line Usage

usage: collector-http[.bat|.sh]
 -a,--action <arg>      Required: one of start|resume|stop|checkcfg
 -c,--config <arg>      Required: Web Crawler configuration file.
 -v,--variables <arg>   Optional: variable file.
 -k,--checkcfg          Validates XML configuration. When combined with
                        -a, prevents execution on configuration error.

The above Web Crawler startup script is found in the root directory of your installation (where you extracted the Zip you downloaded). Refer to the Flow Diagram and Configuration pages for documentation on all configuration options. Refer to ConfigurationLoader Javadoc for details on the optional variables file.

Working Examples

You may prefer to start with a working example and modify it to fit your need. Two such examples are provided in the examples for testing your Norconex Web Crawler installation. The first one represents a very basic configuration and the second shows how you can share configuration fragments.

To run the tests, go to the root directory where you uncompressed the distribution zip and run the following in a command-line console.

# Minimal test:
collector-http[.bat|.sh] -a start -c examples/minimum/minimum-config.xml
 
# Complex test:
collector-http[.bat|.sh] -a start -c examples/complex/complex-config.xml

Look for a directory called examples-output. You will find sub-directories matching the test you ran. In them, you will find a directory called crawledFiles containing the collected data. You need Internet access to run the examples. Modify the example files if you wish to run them on a private network.

Java Integration

If you are using Maven, simply add the project dependency to your pom.xml. If you are not using Maven, you can add all JAR files found in your installation "lib" folder to your application classpath. Configure the HttpCollector class, by passing it a HttpCollectorConfig You can build the configuration using java, or by loading an XML configuration file using the CollectorConfigLoader class. Below is a sample code usage:

/* XML configuration: */
//HttpCollectorConfig collectorConfig = (HttpCollectorConfig)
//        new CollectorConfigLoader(HttpCollectorConfig.class)
//                .loadCollectorConfig(myXMLFile, myVariableFile);
 
/* Java configuration: */
HttpCollectorConfig collectorConfig = new HttpCollectorConfig();
collectorConfig.setId("MyHttpCollector");
collectorConfig.setLogsDir("/tmp/logs/");
...
HttpCrawlerConfig crawlerConfig = new HttpCrawlerConfig();
crawlerConfig.setId("MyHttpCrawler");
crawlerConfig.setStartURLs("http://example1.com", "http://example2.com");
...
collectorConfig.setCrawlerConfigs(crawlerConfig);
 
HttpCollector collector = new HttpCollector(collectorConfig);
collector.start(true);

Refer to the Web Crawler Javadoc for more documentation or the Configuration page for XML configuration options.

Extend the Web Crawler

To create your own feature implementations, create a new Java project in your favorite IDE. Use Maven or add to your classpath all the files contained in the lib folder of the Web Crawler installation. Configure your project to have its binary output directory to be the classes folder of the installation directory. Automatically, code created and stored under classes will be picked up by the Web Crawler when you run it.