# Getting Started
# Quick Start
The fastest way to get going is to start with a working example and modify it
to fit your needs. Two such examples are provided for testing with
your Norconex Web Crawler installation.
The first one represents a very basic
configuration and the second shows how you can share configuration fragments.
NOTE
You need an active Internet connection to perform the following examples.
First, download the latest Zip file version of Web Crawler (version 3.0.0 or higher).
To install, simply uncompress the downloaded Zip file into a directory of your choice.
If you are not in a terminal already, open one and change directory to where you just uncompressed Web Crawler.
Launch the crawler with the "minimum" configuration example.
Linux
collector-http.sh start -config=examples/minimum/minimum-config.xml
Windows
collector-http.bat start -config=examples\minimum\minimum-config.xml
Observe the execution log entries appearing on screen until it finishes.
Congratulations, you have just ran your first Web Crawler crawl!
Now you can perform another launch of the Collector with the "complex" example.
Linux
collector-http.sh start -config=examples/complex/complex-config.xml
Windows
collector-http.bat start -config=examples\complex\complex-config.xml
You should see similar log outputs to your console.
Assuming all went as planned, you should find a new directory named
examples-output
. Navigate through it and any of its sub-directories
until you find files ending with .xml
. Open these files.
You can observe the content and metadata fields that were fetched by the
crawler, converted to XML format.
Have a look at the example configuration files to start familiarizing yourself with their formats and options.
What now?
You can now jump directly to the documentation section of your choice and start to configure your own crawl. Have a look at other Web Crawler reference material available to you. There you shall find links to detailed class documentation (Javadoc) as well as a simple configuration "starter". You may not need it while testing, but eventually, consider using one of the available Committers for storing crawled content. Finally, you can keep reading to become a master of Norconex Web Crawler.
# Command Line Usage
NOTE
Windows users need to replace .sh
with .bat
.
Usage: collector-http.sh [-hv] [COMMAND]
Options:
-h, -help Show usage help message and exit
-v, -version Show the Collector version and exit
Commands:
help Displays help information about the specified command
start Start the Collector
stop Stop the Collector
configcheck Validate configuration file syntax
configrender Render effective configuration
clean Clean the Collector crawling history (to start fresh)
storeexport Export crawl store to specified directory
storeimport Import crawl store from specified files
Examples:
Start the Collector:
collector-http.sh start -config=/path/to/config.xml
Stop the Collector:
collector-http.sh stop -config=/path/to/config.xml
Get usage help on "configcheck" command:
collector-http.sh help configcheck
The above collector-http.sh
(or collector-http.bat
on Windows) startup script is found in the root directory of your
installation (where you extracted the ZIP you
downloaded).
# Java Usage
Programmers can also configure and launch the Web Crawler from within their Java application. There are different ways to go at it depending how much you want to configure in Java vs XML. Here three different scenarios to help you get you started:
Simulate launching from command line with XML file
String[] args = new String[] { "start", "-config=/path/to/config.xml" };
new CollectorCommandLauncher().launch(new HttpCollector(), args);
Configure and launch 100% programmatically
HttpCollectorConfig collectorConfig = new HttpCollectorConfig();
collectorConfig.setId("My Collector Config");
collectorConfig.setWorkDir(Paths.get("/some/optional/path/to/workdir"));
// ...
HttpCrawlerConfig crawlerConfig = new HttpCrawlerConfig();
crawlerConfig.setId("My Crawler Config");
crawlerConfig.setStartURLs("http://example.com/");
// ...
collectorConfig.setCrawlerConfigs(crawlerConfig);
HttpCollector collector = new HttpCollector(collectorConfig);
collector.start();
Hybrid XML/programmatic launch
HttpCollectorConfig collectorConfig = new HttpCollectorConfig();
Path configFile = Paths.get("/path/to/config.xml");
new ConfigurationLoader().loadFromXML(configFile, collectorConfig);
// collectorConfig.set... <-- Any config modification you want to make
// ...
HttpCollector collector = new HttpCollector(collectorConfig);
collector.start();
return collector;
There might be a time where you may want to access specific collector objects before launching or after it is done executing. Using the HttpCollector object created just above, the following example show you a way to get the first Committer from your first Crawler configuration.
return collector.getCrawlers().get(0).getCrawlerConfig().getCommitters().get(0);
Maven
Maven users can find the dependencies to add to their
pom.xml
on the Web Crawler website
download section.