usage: collector-http[.bat|.sh] -a,--action <arg> Required: one of start|resume|stop|checkcfg -c,--config <arg> Required: Web Crawler configuration file. -v,--variables <arg> Optional: variable file. -k,--checkcfg Validates XML configuration. When combined with -a, prevents execution on configuration error.
The above Web Crawler startup script is found in the root directory of your installation (where you extracted the Zip you downloaded). Refer to the Flow Diagram and Configuration pages for documentation on all configuration options. Refer to ConfigurationLoader Javadoc for details on the optional variables file.
You may prefer to start with a working example and modify it to fit
your need. Two such examples are provided in the examples
for testing your Norconex Web Crawler installation.
The first one represents a very basic
configuration and the second shows how you can share configuration fragments.
To run the tests, go to the root directory where you uncompressed the distribution zip and run the following in a command-line console.
# Minimal test: collector-http[.bat|.sh] -a start -c examples/minimum/minimum-config.xml # Complex test: collector-http[.bat|.sh] -a start -c examples/complex/complex-config.xml
Look for a directory called examples-output
. You will
find sub-directories matching the test you ran. In them, you will find
a directory called crawledFiles
containing the collected data.
You need Internet access to run the examples. Modify the example files
if you wish to run them on a private network.
If you are using Maven, simply add the
project dependency to your pom.xml
.
If you are not using Maven, you can add all JAR files found in your installation
"lib" folder to your application classpath. Configure the
HttpCollector class, by passing it a
HttpCollectorConfig
You can build the configuration using java, or by loading an XML configuration
file using the
CollectorConfigLoader class. Below is a sample code usage:
/* XML configuration: */ //HttpCollectorConfig collectorConfig = (HttpCollectorConfig) // new CollectorConfigLoader(HttpCollectorConfig.class) // .loadCollectorConfig(myXMLFile, myVariableFile); /* Java configuration: */ HttpCollectorConfig collectorConfig = new HttpCollectorConfig(); collectorConfig.setId("MyHttpCollector"); collectorConfig.setLogsDir("/tmp/logs/"); ... HttpCrawlerConfig crawlerConfig = new HttpCrawlerConfig(); crawlerConfig.setId("MyHttpCrawler"); crawlerConfig.setStartURLs("http://example1.com", "http://example2.com"); ... collectorConfig.setCrawlerConfigs(crawlerConfig); HttpCollector collector = new HttpCollector(collectorConfig); collector.start(true);
Refer to the Web Crawler Javadoc for more documentation or the Configuration page for XML configuration options.
To create your own feature implementations, create a new Java project in your
favorite IDE. Use Maven or add to your classpath all the files contained in
the lib
folder of the Web Crawler installation. Configure your project
to have its binary output directory to be the classes
folder of the
installation directory. Automatically, code created and stored under classes
will be
picked up by the Web Crawler when you run it.