# Installation

# Manual Setup

Web Crawler

Download the latest Web Crawler version and extract it into a directory of your choice. The new directory created is referred to as your "installation directory" (or install_dir).

Committer

In addition to the Web Crawler you need to a Committer that will persist the crawled data where you want to have it. Refer to the list of available Committers. You will find a few Committers already part of your installation, but chances are you need to download the Committer that best suits your need.

When downloading a Committer, extract it into a different directory and execute the install.sh script (or install.bat on windows) and follow on-screen instructions.

# Crawler Directory Structure

At a minimum, you should find the following directories and files in your installation directory:

.
├─ apidocs/
├─ classes/
├─ examples/
├─ lib/
├─ scripts/
├─ collector-http.bat
├─ collector-http.sh
└─ log4j2.xml

# apidocs/

Offline version of the Java API documentation (JavaDoc) matching the version of Web Crawler you downloaded. This generated documentation is not only meant for developers. You will find each configurable class well documented, including XML configuration syntax and options, as well as usage samples. With the possible exception of the version it represents, it the same as the online API documentation.

# classes/

This folder is for Java developers wishing to create their custom features. You can put your custom Java classes in this folder and they will be automatically picked up on the next execution (provided you are using the supplied .bat or .sh launch script).

If you package your classes into a Jar file, put that file in the "lib" folder instead.

# examples/

To help you get started with the Web Crawler, this folder holds example configuration files that can be used as is to crawl pre-defined test pages. You will also find a configuration reference files detailing many of the available XML options.

To run the examples, refer to Getting Started.

# lib/

This directory holds the Java libraries making up the Norconex Web Crawler product. It contains a mix of Norconex-produced libraries as well as necessary third-party dependencies (all compatible with the open-source Apache License 2.0).

This is also where libraries get added when you install a Committer.

# scripts/

Utility scripts you only need when explicitly referred to in specific feature documentation. Not needed to actually run the Web Crawler.

# collector-http.bat

Launching script to execute the Collector and its Web Crawler(s) on Windows.

# collector-http.sh

Launching script to execute the Collector and its Web Crawler(s) on Linux and other compatible systems.

# log4j2.xml

Configuration file for the default logging system used by the Web Crawler. Many logging options are readily available in the file. You can change the log-level for those as you see fit. For more information, refer to the Log4j 2 Manual.

To use a log4j2 file from a different location, modify the value of the -Dlog4j.configurationFile Java system property in the
collector-http.sh script (or collector-http.bat on Windows) to match your new location.

# Your Own Directory Structure

The Norconex Web Crawler only needs to be installed once, no matter how many sites you are crawling. If you need to crawl different websites requiring different configuration options, you will likely end up with multiple configuration files. In order to update the Web Crawler installation without affecting your configuration, it is recommended that you store your configuration files outside the Web Crawler installation directory.

Choosing a separate root directory for your configuration is a good idea to emphasis that the same Web Crawler installation can be shared by any configuration instance you create. In other words, you can have many concurrent Norconex Web Crawler instances running, each having their own configuration, but all sharing the same binaries (no need to have multiple installs unless you are testing different versions).

There are a multitude of ways you can go about this. The following example ensures all generated files go into the workdir directory while configurations are kept separate for each Crawler, except for shared fragments.

.
├─ norconex-collector-x.x.x/
├─ configs/
│  ├─ collectorA/
│  |  ├─ configA.xml
│  |  ├─ configA.variables
│  ├─ collectorB/
│  |  ├─ configB.xml
│  ├─ fragments/
│  |  ├─ shared-config1.xml
│  |  ├─ shared-config2.xml
└─ workdir/

Another way to approach this is to keep generated files (i.e. workdir) besides configuration so each Collector files are more "self-contained":

.
├─ norconex-collector-x.x.x/
├─ collectorA/
│  ├─ config/
│  |  ├─ configA.xml
│  |  ├─ configA.variables
│  ├─ workdirA/
├─ collectorB/
│  ├─ config/
│  |  ├─ configB.xml
│  |  ├─ configB.variables
│  ├─ workdirB/
│  fragments/
│  ├─ shared-config1.xml
│  ├─ shared-config2.xml

# Docker

TODO

← Getting Started XML Configuration →