# XML Configuration
One of the important goal of Norconex Crawlers is to make its configuration as flexible and portable as possible while keeping it manageable by mere mortals. This is achieved by offering enhanced XML-based configuration. We call it enhanced because it extends upon standard XML to include extra features such as variable resolution, configuration fragment reuse, and more.
TIP
Programmers can also configure the Web Crawler using Java.
# Classes
The Web Crawler can be extended in many ways by plugging in the functionality
you need. This is done through the use of classes. Most configurable
XML elements take a class
attribute that specifies which feature to
use or configure. To programmers, those correspond to Java classes. Each
class offers its own set configuration options, which are documented
separately via generated JavaDoc. For example:
<handler class="ConstantTagger">
<constant name="category">finamce</constant>
<constant name="language">en</constant>
</handler>
<handler class="CurrentDateTagger"
toField="date_crawled"
format="yyyy-MM-dd"/>
The above snippet illustrates the use of configurable elements named
"handlers" from the Importer
module. Each handler are configuring a different class. One that will enrich
a document with constant values and the other will add the date it was
crawled. You can see that each handler defines its class the
same way, but otherwise expects different configuration syntax.
The ConstantTagger
expects nested elements to define its constants while
the CurrentDateTagger
uses a few in-line attributes with no body elements.
All configurable options for these two classes are well described in their
corresponding JavaDoc:
# Name Resolution
As stated earlier, classes reference Java classes. Programmers
are used to see SomeClass
accompanied with their package declaration
such com.myorg.app.SomeClass
to avoid name conflict. You can use either
the short or long name version in your XML configuration to reference
classes. In fact, partial package
structure can be specified as well. For example, all of these are equivalent:
<handler class="com.norconex.importer.handler.tagger.impl.ConstantTagger">...</handler>
<handler class="handler.tagger.impl.ConstantTagger">...</handler>
<handler class="ConstantTagger">...</handler>
Use whatever form you are more comfortable with. For instance, you can use the shorter form for an easier read, while reserving the longer form in case of name conflicts.
# Default Classes
Some configurable elements assumes default classes (documented as such). In
such case, you can omit the class
attribute altogether. The
<urlNormalizer>
element is such an example, having
GenericURLNormalizer
as the default class. Those are equivalent:
<urlNormalizer class="GenericURLNormalizer">...</urlNormalizer>
<urlNormalizer>...</urlNormalizer>
When applicable, you can remove a default implementation by setting it to
null
. This is achieved by having a self-closed element without any
attributes, like this:
<urlNormalizer/>
# Conventions
Unless explicitly stated otherwise in the documentation, you can assume the following when editing your XML configuration files.
# Boolean Values
Configurable elements taking a boolean value (i.e., true
or false
) are
always false
by default and can be omitted.
For instance, the following two lines are equivalent:
<documentChecksummer disabled="false" />
<documentChecksummer />
# White Spaces
There are no guarantee all white spaces will be preserved in an element text.
At a minimum, you can expect leading and trailing spaces to be trimmed.
If preserving all white spaces is important, you can use the standard
XML attribute xml:space="preserve"
on your element:
<!-- Value set to "fruit": -->
<value> fruit </value>
<!-- Value set to " fruit ": -->
<value xml:space="preserve"> fruit </value>
# Empty String vs null
An empty string can sometimes be considered a valid configuration value. In
such cases, it may be important to distinguish between an empty string
and a null
value (i.e. no value).
Empty tags are interpreted as having an empty string. Self-closing tags
have their value interpreted as null
. Finally, non-existing tags
have no effect (no value is set, using whatever default).
For example, let's pretend you have a configuration option with the
default value apple
. Depending how you write your XML, you can
expect different behaviors.
<!-- Value set to "orange": -->
<config>
<value>orange</value>
</config>
<!-- Value set to "" (empty string): -->
<config>
<value></value>
</config>
<!-- Value set to NULL: -->
<config>
<value/>
</config>
<!-- Value not set (remains "apple"): -->
<config>
</config>
# Empty List
Self-closing or empty tags also influence how configuration sections expecting a list of elements are interpreted. For such lists, having a self-closing or empty parent tag indicates you explicitely do not want any item in the list, effectively clearing any defaults. Not defining the parent tag at all means you are happy with the defaults (if any). If there are no defaults for a list of items, then specifying a self-closing, empty, or no tag at all has no effect.
For example, the Web Crawler default configuration has one extractor
defined in its list of link extractors
(HtmlLinkExtractor
). Depending how you configure it, you will get
different outcomes:
<!-- Add an additional link extractor: -->
<crawler id="myCrawler">
<linkExtractors>
<extractor class="HtmlLinkExtractor"/>
<extractor class="DOMLinkExtractor"/>
</linkExtractors>
</crawler>
<!-- Replace the default link extractor: -->
<crawler id="myCrawler">
<linkExtractors>
<extractor class="DOMLinkExtractor"/>
</linkExtractors>
</crawler>
<!-- 2 ways to remove the default link extractor: -->
<crawler id="myCrawler">
<linkExtractors></linkExtractors>
</crawler>
<crawler id="myCrawler">
<linkExtractors/>
</crawler>
<!-- Link extractor list not set, remains HtmlLinkExtractor: -->
<crawler id="myCrawler">
</crawler>
# Templating
While being referred to as XML, Crawler configuration files are in fact Apache Velocity templates being interpreted by the Velocity engine before being applied to the crawler (with a few added tweaks).
This is done transparently and won't impact how you write your XML if you don't want to. What it does though is offer additional syntax options to facilitate writing and maintaining your configuration.
The next sections of this page describe how your configuration files can benefit from using Velocity syntax. We encourage you to experiment and read the Velocity user guide for more information.
# Variables
You can use variables in your configuration files. This is particularly useful for such things as:
- Move out any configuration settings that are environment-specific to increase your configuration files portability.
- Improve re-usability of configuration fragments.
- Facilitate maintenance by defining some values only once but referencing them in multiple locations.
# Notation
Variables are written as per Velocity template language (VTL), which can be summarized as follow:
Starts with the dollar sign (
$
) followed by the variable name surrounded by curly braces ({
and}
). Braces can be omitted if your variable name is not followed by other alphanumeric characters but we recommend keeping them to ensure proper parsing. Example for a variable namednumThread
:<numThreads>${numThreads}</numThreads>
You can specify default values for undefined variables with the vertical bar (
|
) as a separator. Example:<numThreads>${numThreads|'10'}</numThreads>
There are multiple ways to declare variables for use in your configuration files:
# In-line Directives
Variables can be declared directly in your configuration file using the
Velocity set
directive. Example:
#set($pdfMatcher = '<valueMatcher>application/pdf</<valueMatcher>')
The above example assume a frequent use of a <valueMatcher>
tag to match
PDF content types. The $pdfMatcher
variable can be referenced as often
as needed in your configuration. Example:
<restrictTo>
<fieldMatcher>document.contentType</fieldMatcher>
${pdfMatcher}
</restrictTo>
# Environment Variables
Your configuration variables can refer to environment variables of the same name or a supported variant. For each configuration variable, the following logic applies to resolve it against an environment variable:
- First looks for an environment variable with the exact same name.
- If no matches are found, it iterates through all environment variable names and compare them all with the requested name, but only after stripping all non alpha-numeric characters and ignoring case.
As an example, a configuration variable ${hostName}
can successfully
be resolved against any of these environment variables:
hostName
hostname
HOST_NAME
Host-Name
host.name
NOTE
Environment variables are the same ones Java developers can obtain using System#getenv().
# System Properties
System properties are arguments passed as arguments to the Java Virtual
Machine (JVM) at launch time using the -D
flag
(e.g., -Dhost.name=example.com
). They are otherwise resolved
like environment variables (see previous section). When both are specified,
system properties take precedence over environment variables.
NOTE
System properties are the same ones Java developers can obtain using System#getProperties().
# Variables File
Variables can also be declared in a file dedicated to that effect. A variables
file is expected to end with .properties
or variables
(each explained further down). Such a file
can be passed as an argument at launch time or detected and loaded implicitly.
# As argument
To pass a variables file as argument, add the -variables
flag followed by
the path to your variables file as a launch argument.
The following assumes
you created a different variable files per environments and you are specifying
the production one:
NOTE
Windows users need to replace .sh
with .bat
.
collector-http.sh start -config=myconfig.xml -variables=env-prod.variables
# Implicit
To have a variables file loaded automatically, give it the same name
as your configuration file with the exception of the file extension. This holds
true for configuration fragments as well. For instance, given the following
files, configA.xml
and frag1.xml
will both
have their associated variable files automatically loaded (no need to
pass them as command-line argument):
.
├─ configA.xml
├─ configA.variables
├─ configB.xml
│ ├─ fragments/
│ | ├─ frag1.xml
│ | ├─ frag1.properties
│ | ├─ frag2.xml
TIP
Java developers can also specify a .variables
or .properties
file
using the ConfigurationLoader#setVariablesFile(Path)
method.
# Variables File Format
A variable file can have one of two possible formats, represented by
the .variables
or .properties
extensions. Both share a key-value format,
with the key being the variable name.
When both .variables
and .properties
files exist for a configuration file,
the variables in the .properties
file take precedence.
The two formats have minor differences:
# .variables
A .variables
file must have keys and values separated by an equal sign and
only supports one variable per line. The key and value strings are taken
literally, after trimming leading and trailing spaces. Example:
numThreads = 5
host = example.com
port = 8983
# .properties
A .properties
file expected syntax is defined by Java programming
language. It is essentially the same,
but has more options (e.g. multi-line support) and gotchas (e.g. must escape
certain characters). Please refer to the
corresponding Java API documentation
for exact syntax and parsing logic. Example:
numThreads = 5
fileExtensions = pdf, html, doc, \
docx, ppt, pptx, msg, png, \
xml
# Resolution Order
Variables defined in multiple locations are resolved in the following order of precedence:
- System Properties
- Environment Variables
.properties
file.variables
file
# Configuration Fragments
To include configuration fragments and favor reuse, use
the #include("myfile.cfg")
or #parse("myfile.cfg")
directives. An #include
directive will include the referenced file as-is, without interpretation.
A #parse
directive will treat the included file as a Velocity template file
and will interpret it (along with its variables file if one exists).
The included/parsed files are relative to the parent template, or, can be absolute paths on the host where the configuration loader is executed. The following is an example of using configuration fragments for dynamic inclusion. Both Windows and UNIX path styles are equally supported.
Sample directory structure:
.
├─ configs/
│ ├─ myconfig.cfg
│ ├─ myconfig.properties
├─ fragments/
│ ├─ shared.cfg
│ ├─ shared.variables
Configuration file myconfig.cfg
:
<myconfig>
<host>$host</host>
<port>$port</port>
#parse("../fragments/shared.cfg")
</myconfig>
Explanation:
When loading myconfig.cfg
, the variables defined in myconfig.properties are
automatically loaded and will replace the $host and $port variables.
The myconfig.cfg
file is also parsing a shared configuration
file: shared.cfg
. That file will be parsed and inserted, with its variables
defined in shared.variables
automatically loaded and resolved.
Other Velocity directives are supported (if-else statements, foreach loops, macros, etc). Refer to Velocity User Guide for complete syntax and template documentation.