# Installation

# Manual Setup

HTTP Collector

Download the latest HTTP Collector version and extract it into a directory of your choice. The new directory created is referred to as your "installation directory" (or install_dir).

Committer

In addition to the HTTP Collector you need to a Committer that will persist the crawled data where you want to have it. Refer to the list of available Committers. You will find a few Committers already part of your installation, but chances are you need to download the Committer that best suits your need.

When downloading a Committer, extract it into a different directory and execute the install.sh script (or install.bat on windows) and follow on-screen instructions.

# Collector Directory Structure

At a minimum, you should find the following directories and files in your installation directory:

.
├─ apidocs/
├─ classes/
├─ examples/
├─ lib/
├─ scripts/
├─ collector-http.bat
├─ collector-http.sh
└─ log4j2.xml

# apidocs/

Offline version of the Java API documentation (JavaDoc) matching the version of HTTP Collector you downloaded. This generated documentation is not only meant for developers. You will find each configurable class well documented, including XML configuration syntax and options, as well as usage samples. With the possible exception of the version it represents, it the same as the online API documentation.

# classes/

This folder is for Java developers wishing to create their custom features. You can put your custom Java classes in this folder and they will be automatically picked up on the next execution (provided you are using the supplied .bat or .sh launch script).

If you package your classes into a Jar file, put that file in the "lib" folder instead.

# examples/

To help you get started with the HTTP Collector, this folder holds example configuration files that can be used as is to crawl pre-defined test pages. You will also find a configuration reference files detailing many of the available XML options.

To run the examples, refer to Getting Started.

# lib/

This directory holds the Java libraries making up the HTTP Collector solutions. It contains a mix of Norconex-produced libraries as well as necessary third-party dependencies (all compatible with the open-source Apache License 2.0).

This is also where libraries get added when you install a Committer.

# scripts/

Utility scripts you only need when explicitly referred to in specific feature documentation. Not needed to actually run the HTTP Collector.

# collector-http.bat

Launching script to execute the HTTP Collector on Windows.

# collector-http.sh

Launching script to execute the HTTP Collector on Linux and other compatible systems.

# log4j2.xml

Configuration file for the default logging system used by the HTTP Collector. Many logging options are readily available in the file. You can change the log-level for those as you see fit. For more information, refer to the Log4j 2 Manual.

To use a log4j2 file from a different location, modify the value of the -Dlog4j.configurationFile Java system property in the
collector-http.sh script (or collector-http.bat on Windows) to match your new location.

# Your Own Directory Structure

The HTTP Collector only needs to be installed once, no matter how many sites you are crawling. If you need to crawl different websites requiring different configuration options, you will likely end up with multiple configuration files. In order to update the HTTP Collector installation without affecting your configuration, it is recommended that you store your configuration files outside the HTTP Collector installation directory.

Choosing a separate root directory for your configuration is a good idea to emphasis that the same Collector installation can be shared by any configuration instance you create. In other words, you can have many concurrent Norconex Collector instances running, each having their own configuration, but all sharing the same binaries (no need to have multiple installs unless you are testing different versions).

There are a multitude of ways you can go about this. The following example ensures all generated files go into the workdir directory while configurations are kept separate for each Collector, except for shared fragments.

.
├─ norconex-collector-x.x.x/
├─ configs/
│  ├─ collectorA/
│  |  ├─ configA.xml
│  |  ├─ configA.variables
│  ├─ collectorB/
│  |  ├─ configB.xml
│  ├─ fragments/
│  |  ├─ shared-config1.xml
│  |  ├─ shared-config2.xml
└─ workdir/

Another way to approach this is to keep generated files (i.e. workdir) besides configuration so each Collector files are more "self-contained":

.
├─ norconex-collector-x.x.x/
├─ collectorA/
│  ├─ config/
│  |  ├─ configA.xml
│  |  ├─ configA.variables
│  ├─ workdirA/
├─ collectorB/
│  ├─ config/
│  |  ├─ configB.xml
│  |  ├─ configB.variables
│  ├─ workdirB/
│  fragments/
│  ├─ shared-config1.xml
│  ├─ shared-config2.xml

# Docker

TODO

TODO

Last Updated: 10/10/2020, 12:38:26 AM