# Getting Started

# Quick Start

The fastest way to get going is to start with a working example and modify it to fit your needs. Two such examples are provided for testing with your Norconex Web Crawler installation.
The first one represents a very basic configuration and the second shows how you can share configuration fragments.

NOTE

You need an active Internet connection to perform the following examples.

  1. First, download the latest Zip file version of Web Crawler (version 3.0.0 or higher).

  2. To install, simply uncompress the downloaded Zip file into a directory of your choice.

  3. If you are not in a terminal already, open one and change directory to where you just uncompressed Web Crawler.

  4. Launch the crawler with the "minimum" configuration example.

    Linux

    collector-http.sh start -config=examples/minimum/minimum-config.xml
    

    Windows

    collector-http.bat start -config=examples\minimum\minimum-config.xml
    

    Observe the execution log entries appearing on screen until it finishes.

    Congratulations, you have just ran your first Web Crawler crawl!

  5. Now you can perform another launch of the Collector with the "complex" example.

    Linux

    collector-http.sh start -config=examples/complex/complex-config.xml
    

    Windows

    collector-http.bat start -config=examples\complex\complex-config.xml
    

    You should see similar log outputs to your console.

Assuming all went as planned, you should find a new directory named examples-output. Navigate through it and any of its sub-directories until you find files ending with .xml. Open these files.
You can observe the content and metadata fields that were fetched by the crawler, converted to XML format.

Have a look at the example configuration files to start familiarizing yourself with their formats and options.

What now?

You can now jump directly to the documentation section of your choice and start to configure your own crawl. Have a look at other Web Crawler reference material available to you. There you shall find links to detailed class documentation (Javadoc) as well as a simple configuration "starter". You may not need it while testing, but eventually, consider using one of the available Committers for storing crawled content. Finally, you can keep reading to become a master of Norconex Web Crawler.

# Command Line Usage

NOTE

Windows users need to replace .sh with .bat.

Usage: collector-http.sh [-hv] [COMMAND]

Options:
  -h, -help      Show usage help message and exit
  -v, -version   Show the Collector version and exit

Commands:
  help          Displays help information about the specified command
  start         Start the Collector
  stop          Stop the Collector
  configcheck   Validate configuration file syntax
  configrender  Render effective configuration
  clean         Clean the Collector crawling history (to start fresh)
  storeexport   Export crawl store to specified directory
  storeimport   Import crawl store from specified files

Examples:

  Start the Collector:

    collector-http.sh start -config=/path/to/config.xml

  Stop the Collector:

    collector-http.sh stop -config=/path/to/config.xml

  Get usage help on "configcheck" command:

    collector-http.sh help configcheck

The above collector-http.sh (or collector-http.bat on Windows) startup script is found in the root directory of your installation (where you extracted the ZIP you downloaded).

# Java Usage

Programmers can also configure and launch the Web Crawler from within their Java application. There are different ways to go at it depending how much you want to configure in Java vs XML. Here three different scenarios to help you get you started:

Simulate launching from command line with XML file

String[] args = new String[] { "start", "-config=/path/to/config.xml" };
new CollectorCommandLauncher().launch(new HttpCollector(), args);

Configure and launch 100% programmatically

HttpCollectorConfig collectorConfig = new HttpCollectorConfig();
collectorConfig.setId("My Collector Config");
collectorConfig.setWorkDir(Paths.get("/some/optional/path/to/workdir"));
// ...
HttpCrawlerConfig crawlerConfig = new HttpCrawlerConfig();
crawlerConfig.setId("My Crawler Config");
crawlerConfig.setStartURLs("http://example.com/");
// ...
collectorConfig.setCrawlerConfigs(crawlerConfig);

HttpCollector collector = new HttpCollector(collectorConfig);
collector.start();

Hybrid XML/programmatic launch

HttpCollectorConfig collectorConfig = new HttpCollectorConfig();
Path configFile = Paths.get("/path/to/config.xml");
new ConfigurationLoader().loadFromXML(configFile, collectorConfig);
// collectorConfig.set...  <-- Any config modification you want to make
// ...

HttpCollector collector = new HttpCollector(collectorConfig);
collector.start();
return collector;

There might be a time where you may want to access specific collector objects before launching or after it is done executing. Using the HttpCollector object created just above, the following example show you a way to get the first Committer from your first Crawler configuration.

return collector.getCrawlers().get(0).getCrawlerConfig().getCommitters().get(0);

Maven

Maven users can find the dependencies to add to their pom.xml on the Web Crawler website download section.

Last Updated: 10/14/2021, 4:44:32 PM