# XML Configuration

One of the important goal of Norconex Crawlers is to make its configuration as flexible and portable as possible while keeping it manageable by mere mortals. This is achieved by offering enhanced XML-based configuration. We call it enhanced because it extends upon standard XML to include extra features such as variable resolution, configuration fragment reuse, and more.

TIP

Programmers can also configure the Web Crawler using Java.

# Classes

The Web Crawler can be extended in many ways by plugging in the functionality you need. This is done through the use of classes. Most configurable XML elements take a class attribute that specifies which feature to use or configure. To programmers, those correspond to Java classes. Each class offers its own set configuration options, which are documented separately via generated JavaDoc. For example:

<handler class="ConstantTagger">
  <constant name="category">finamce</constant>
  <constant name="language">en</constant>
</handler>
<handler class="CurrentDateTagger" 
    toField="date_crawled"
    format="yyyy-MM-dd"/>

The above snippet illustrates the use of configurable elements named "handlers" from the Importer module. Each handler are configuring a different class. One that will enrich a document with constant values and the other will add the date it was crawled. You can see that each handler defines its class the same way, but otherwise expects different configuration syntax. The ConstantTagger expects nested elements to define its constants while the CurrentDateTagger uses a few in-line attributes with no body elements. All configurable options for these two classes are well described in their corresponding JavaDoc:

# Name Resolution

As stated earlier, classes reference Java classes. Programmers are used to see SomeClass accompanied with their package declaration such com.myorg.app.SomeClass to avoid name conflict. You can use either the short or long name version in your XML configuration to reference classes. In fact, partial package structure can be specified as well. For example, all of these are equivalent:

<handler class="com.norconex.importer.handler.tagger.impl.ConstantTagger">...</handler>
<handler class="handler.tagger.impl.ConstantTagger">...</handler>
<handler class="ConstantTagger">...</handler>

Use whatever form you are more comfortable with. For instance, you can use the shorter form for an easier read, while reserving the longer form in case of name conflicts.

# Default Classes

Some configurable elements assumes default classes (documented as such). In such case, you can omit the class attribute altogether. The <urlNormalizer> element is such an example, having GenericURLNormalizer as the default class. Those are equivalent:

<urlNormalizer class="GenericURLNormalizer">...</urlNormalizer>
<urlNormalizer>...</urlNormalizer>

When applicable, you can remove a default implementation by setting it to null. This is achieved by having a self-closed element without any attributes, like this:

<urlNormalizer/>

# Conventions

Unless explicitly stated otherwise in the documentation, you can assume the following when editing your XML configuration files.

# Boolean Values

Configurable elements taking a boolean value (i.e., true or false) are always false by default and can be omitted. For instance, the following two lines are equivalent:

<documentChecksummer disabled="false" />
<documentChecksummer />

# White Spaces

There are no guarantee all white spaces will be preserved in an element text. At a minimum, you can expect leading and trailing spaces to be trimmed. If preserving all white spaces is important, you can use the standard XML attribute xml:space="preserve" on your element:


<!-- Value set to "fruit": -->
<value>  fruit  </value>

<!-- Value set to "  fruit  ": -->
<value xml:space="preserve">  fruit  </value>

# Empty String vs null

An empty string can sometimes be considered a valid configuration value. In such cases, it may be important to distinguish between an empty string and a null value (i.e. no value).

Empty tags are interpreted as having an empty string. Self-closing tags have their value interpreted as null. Finally, non-existing tags have no effect (no value is set, using whatever default).

For example, let's pretend you have a configuration option with the default value apple. Depending how you write your XML, you can expect different behaviors.

<!-- Value set to "orange": -->
<config>
  <value>orange</value>
</config>

<!-- Value set to "" (empty string): -->
<config>
  <value></value>
</config>

<!-- Value set to NULL: -->
<config>
  <value/>
</config>

<!-- Value not set (remains "apple"): -->
<config>
</config>

# Empty List

Self-closing or empty tags also influence how configuration sections expecting a list of elements are interpreted. For such lists, having a self-closing or empty parent tag indicates you explicitely do not want any item in the list, effectively clearing any defaults. Not defining the parent tag at all means you are happy with the defaults (if any). If there are no defaults for a list of items, then specifying a self-closing, empty, or no tag at all has no effect.

For example, the Web Crawler default configuration has one extractor defined in its list of link extractors (HtmlLinkExtractor). Depending how you configure it, you will get different outcomes:

<!-- Add an additional link extractor: -->
<crawler id="myCrawler">
  <linkExtractors>
    <extractor class="HtmlLinkExtractor"/>
    <extractor class="DOMLinkExtractor"/>
  </linkExtractors>
</crawler>

<!-- Replace the default link extractor: -->
<crawler id="myCrawler">
  <linkExtractors>
    <extractor class="DOMLinkExtractor"/>
  </linkExtractors>
</crawler>

<!-- 2 ways to remove the default link extractor: -->
<crawler id="myCrawler">
  <linkExtractors></linkExtractors>
</crawler>
<crawler id="myCrawler">
  <linkExtractors/>
</crawler>

<!-- Link extractor list not set, remains HtmlLinkExtractor: -->
<crawler id="myCrawler">
</crawler>

# Templating

While being referred to as XML, Crawler configuration files are in fact Apache Velocity templates being interpreted by the Velocity engine before being applied to the crawler (with a few added tweaks).

This is done transparently and won't impact how you write your XML if you don't want to. What it does though is offer additional syntax options to facilitate writing and maintaining your configuration.

The next sections of this page describe how your configuration files can benefit from using Velocity syntax. We encourage you to experiment and read the Velocity user guide for more information.

# Variables

You can use variables in your configuration files. This is particularly useful for such things as:

  • Move out any configuration settings that are environment-specific to increase your configuration files portability.
  • Improve re-usability of configuration fragments.
  • Facilitate maintenance by defining some values only once but referencing them in multiple locations.

# Notation

Variables are written as per Velocity template language (VTL), which can be summarized as follow:

  • Starts with the dollar sign ($) followed by the variable name surrounded by curly braces ({ and }). Braces can be omitted if your variable name is not followed by other alphanumeric characters but we recommend keeping them to ensure proper parsing. Example for a variable named numThread:

    <numThreads>${numThreads}</numThreads>
    
  • You can specify default values for undefined variables with the vertical bar (|) as a separator. Example:

    <numThreads>${numThreads|'10'}</numThreads>
    

There are multiple ways to declare variables for use in your configuration files:

# In-line Directives

Variables can be declared directly in your configuration file using the Velocity set directive. Example:

 #set($pdfMatcher = '<valueMatcher>application/pdf</<valueMatcher>')

The above example assume a frequent use of a <valueMatcher> tag to match PDF content types. The $pdfMatcher variable can be referenced as often as needed in your configuration. Example:

<restrictTo>
  <fieldMatcher>document.contentType</fieldMatcher>
  ${pdfMatcher}
</restrictTo>

# Environment Variables

Your configuration variables can refer to environment variables of the same name or a supported variant. For each configuration variable, the following logic applies to resolve it against an environment variable:

  • First looks for an environment variable with the exact same name.
  • If no matches are found, it iterates through all environment variable names and compare them all with the requested name, but only after stripping all non alpha-numeric characters and ignoring case.

As an example, a configuration variable ${hostName} can successfully be resolved against any of these environment variables:

  • hostName
  • hostname
  • HOST_NAME
  • Host-Name
  • host.name

NOTE

Environment variables are the same ones Java developers can obtain using System#getenv().

# System Properties

System properties are arguments passed as arguments to the Java Virtual Machine (JVM) at launch time using the -D flag (e.g., -Dhost.name=example.com). They are otherwise resolved like environment variables (see previous section). When both are specified, system properties take precedence over environment variables.

NOTE

System properties are the same ones Java developers can obtain using System#getProperties().

# Variables File

Variables can also be declared in a file dedicated to that effect. A variables file is expected to end with .properties or variables (each explained further down). Such a file can be passed as an argument at launch time or detected and loaded implicitly.

# As argument

To pass a variables file as argument, add the -variables flag followed by the path to your variables file as a launch argument. The following assumes you created a different variable files per environments and you are specifying the production one:

NOTE

Windows users need to replace .sh with .bat.

collector-http.sh start -config=myconfig.xml -variables=env-prod.variables

# Implicit

To have a variables file loaded automatically, give it the same name as your configuration file with the exception of the file extension. This holds true for configuration fragments as well. For instance, given the following files, configA.xml and frag1.xml will both have their associated variable files automatically loaded (no need to pass them as command-line argument):

.
├─ configA.xml
├─ configA.variables
├─ configB.xml
│  ├─ fragments/
│  |  ├─ frag1.xml
│  |  ├─ frag1.properties
│  |  ├─ frag2.xml

TIP

Java developers can also specify a .variables or .properties file using the ConfigurationLoader#setVariablesFile(Path) method.

# Variables File Format

A variable file can have one of two possible formats, represented by the .variables or .properties extensions. Both share a key-value format, with the key being the variable name.

When both .variables and .properties files exist for a configuration file, the variables in the .properties file take precedence.

The two formats have minor differences:

# .variables

A .variables file must have keys and values separated by an equal sign and only supports one variable per line. The key and value strings are taken literally, after trimming leading and trailing spaces. Example:

numThreads = 5
host = example.com
port = 8983
# .properties

A .properties file expected syntax is defined by Java programming language. It is essentially the same, but has more options (e.g. multi-line support) and gotchas (e.g. must escape certain characters). Please refer to the corresponding Java API documentation for exact syntax and parsing logic. Example:

numThreads = 5
fileExtensions = pdf, html, doc, \
    docx, ppt, pptx, msg, png, \
    xml

# Resolution Order

Variables defined in multiple locations are resolved in the following order of precedence:

  1. System Properties
  2. Environment Variables
  3. .properties file
  4. .variables file

# Configuration Fragments

To include configuration fragments and favor reuse, use the #include("myfile.cfg") or #parse("myfile.cfg") directives. An #include directive will include the referenced file as-is, without interpretation. A #parse directive will treat the included file as a Velocity template file and will interpret it (along with its variables file if one exists).

The included/parsed files are relative to the parent template, or, can be absolute paths on the host where the configuration loader is executed. The following is an example of using configuration fragments for dynamic inclusion. Both Windows and UNIX path styles are equally supported.

Sample directory structure:

.
├─ configs/
│  ├─ myconfig.cfg
│  ├─ myconfig.properties
├─ fragments/
│  ├─ shared.cfg
│  ├─ shared.variables

Configuration file myconfig.cfg:

 <myconfig>
    <host>$host</host>
    <port>$port</port>
    #parse("../fragments/shared.cfg")
 </myconfig>

Explanation:

When loading myconfig.cfg, the variables defined in myconfig.properties are automatically loaded and will replace the $host and $port variables. The myconfig.cfg file is also parsing a shared configuration file: shared.cfg. That file will be parsed and inserted, with its variables defined in shared.variables automatically loaded and resolved.

Other Velocity directives are supported (if-else statements, foreach loops, macros, etc). Refer to Velocity User Guide for complete syntax and template documentation.

Last Updated: 10/12/2021, 2:52:12 AM