Norconex Web Crawler

FAQ and HOWTOs

If you do not find answers to your questions here, please ask your question on GitHub and it may find its way here.

All Crawlers

What file formats are supported?

The parsing of downloaded files is performed by the Norconex Importer. You can read on its web site the full list of supported file formats.

How to prevent fields from being added to a document

DeleteTagger and KeeyOnlyTagger will help you produce just the fields you want. The later is probably the one you want in most cases. The following shows how to eliminate all fields from a document, except for the document reference, keywords, and description fields:

<importer>
    <postParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>document.reference, keywords, description</fields> 
        </tagger> 
    </postParseHandlers>
</importer>

How to chose the right Crawl Database implementation

Norconex Crawlers need a database to store key reference information about a collected document (URL, path, etc.). Three implementations are offered out-of-the-box: MVStore, MapDB, MongoDB, and JDBC (Derby or H2). Prior to version 2.5.0 of both HTTP and File System Crawlers, MapDB was the default implementation. Since version 2.5.0 of these collectors, MVStore is now the default implementation. Using the default implementation does not require explicit configuration. The following will help you decide which one is the right one for you:

  • MVStore: Fast key/value database. It should address the vast majority of cases. MVStore performs really well on a single server, using a mix between memory and the local file-system. When in doubt, use this one. If you run into issues with MVStore, MapDB is a good alternative to try.

  • MapDB: Fast key/value database like MapDB.

  • MongoDB: When dealing with several millions or billions of document references with MVStore or MapDB, the disk space required to store them may grow too large for you. For average scenarios, MongoDB should not bring any performance gain, but it allows you to use a distributed cluster to store references on huge crawls.

  • JDBC: The slowest implementation by far. For relatively small crawls, Derby/H2 are excellent if you want to issue SQL queries against crawled URLs, for reporting or else.

How to setup a good directory structure

There is a multitude of ways you can go at this. The following example ensures generated files are kept together for each collectors you define, favors configuration re-use, and ensures portability across environments. Replace <ROOT_DIR> with the directory of your choice. It does not have to be within the directory where you have installed Norconex Collector (but can if you like). Choosing a separate root directory for your configuration may be a good idea to put emphasis that the same Collector can be shared by any configuration instance you create. In other words, you can have many concurrent Norconex Collector instances running, each having their own configuration, but all sharing the same binaries (no need to have multiple installs unless you are testing different versions). In this example, we store the Collector next to other directories to demonstrate that.

<ROOT_DIR>/
    norconex-collector-xxxx-x.x.x/ <-- Collector installation
    shared_configs/                <-- contains fragments for all your configs.
    collectors/                    <-- one sub-directory per collector you have
        myCollectorA/              <-- Your first collector
            config/                <-- All configs specific to myCollectorA
            workdir/               <-- All myCollectorB crawler generated files
        myCollectorB/              <-- Your second collector
            config/                <-- All configs specific to myCollectorA
            workdir/               <-- All myCollectorB crawler generated files
        ...

To take advantage of this directory structure, you have to update your configuration file(s) accordingly. Let's assume we are working on "myCollectorA". The first thing you want to do to ensure portability is to abstract your environment specific values in a variables file. In this case we'll assume the path is different. We'll store our path in a file named <ROOT_DIR>/crawlers/myCollectorA/config/collectorA-config.variables and it will contain the following:

workdir = <ROOT_DIR>/crawlers/myCollectorA/workdir/

The workdir variable can be referenced in your Collector configuration with the dollar sign prefix $workdir or ${workdir}. The variable file will be automatically loaded if you store it in the same folder as your main config, with the same name (but different extension): <ROOT_DIR>/crawlers/myCollectorA/config/collectorA-config.xml. You can specify an alternate variables file path and name when you specify it as an extra argument when you launch the Collector. We will use our variable to store every generated file in our working directory (Web Crawler example):

<httpcollector id="my-Collector-A">
  <!-- Uncomment the following to hard-code the working directory instead:
  ##set($workdir = "<ROOT_DIR>/crawlers/myCollectorA/workdir/")
    -->
  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>
 
  ...
 
  <!-- The following takes advantaged of sharing configuration files. -->
  <importer>
     #parse("../../../shared_configs/shared-imports.cfg")
  </importer>
 
  <crawler id="any-crawler">
    <workDir>$workdir/any-crawler</workDir>
    ...
  </crawler>
</httpcollector>

How to change log levels

Locate the log4j.properties file located in the root of your Web Crawler installation directory. It is self-documented enough to change the log level for most frequent scenarios (do not log OK urls, log rejection details, etc). You can adjust the log level for other particular classes of a Collector or even your own. Visit log4j site for more information.

One collector with several crawlers or a crawler with several starting sources?

The following is an edited version of a question answered on Github.

There are no fixed rules for this. It often comes down to whatever is easier to maintain for you. The following are considerations that can help you decide the best approach for you:

  • Ensure all starting references (e.g. URLs, paths) you put under the same crawler share the same settings or crawl requirements. For instance, two web sites may need different authentication, one you may want to strip its header on every page, the other the sidebar, etc. Your configuration settings may not always apply to each sites and cause trouble. In such case set them in different crawlers/collectors.

  • What if you have 1 of your 10 sources under 1 crawler failing to get crawled properly because of a bad configuration or else? Then you will have to either restart the crawl with the 10 sources just to troubleshoot that one, or you may comment the other ones while you troubleshoot. None are always ideal. You may want to separate your sources in different crawlers to avoid this, especially when in heavy development stage.

  • The above scenario also applies if you have multiple crawlers defined. If only one needs restarting, you'll be forced to restart them all. Separating them in different collectors can sometimes work best.

  • Do you have 1000+ sources to crawl? It maybe a big challenge to apply the above tips with so many. You do not want to create tons of different crawlers/collectors. In which case, it may be a good mixed approach to put long-running, or more complex sources in their own crawler or collector configuration, and put several smaller/simple sources together.

  • If crawling many sources with several threads is becoming too taxing on your hardware resources, you may want to use multiple servers to spread the load. This is obviously a very good reason to have more than one collector.

  • Putting the above considerations aside, one good reason for having multiple crawlers under the same collector, is if most are sharing common values, with a few exceptions only. You can define most settings only once under <crawlerDefaults> and have the specifics under each specific crawlers. Keep in mind you can also share configuration snippets between different configuration files.

How to solve "Too many open files" errors

This is an common issue on Linux/Unix systems when the number of file handles that can be open at once by a process is limited. Norconex Collectors use third party libraries for parsing files. The parsing of certain file types or files with several embedded objects can sometime produce a large quantity of temporary files. This heavy reliance on temporary files is often an approach taken to avoid using too much memory. Luckily, operating systems can be configured to increase the "open file" limit (or make it unlimited). You can find several instructions online to solve this problem, like this one.

Web Crawler

How to perform authentication on password-protected websites

The GenericHttpClientFactory class offers different types of authentication schemes: FORM, BASIC, DIGEST, NTLM and experimental ones. Below are configuration examples covering three frequent ones:

FORM-Based Authentication:

<httpClientFactory>
    <authMethod>form</authMethod>
    <authUsername>smithj</authUsername>
    <authPassword>secret</authPassword>
 
    <!-- Name of the HTML fields (or HTTP parameters) used to authenticate -->
    <authUsernameField>username</authUsernameField>
    <authPasswordField>password</authPasswordField>
 
    <!-- URL of the login form page --> 
    <authURL>http://mysite.com/secure/login.html</authURL>
</httpClientFactory>

BASIC or DIGEST Authentication:

<httpClientFactory>
    <!-- For DIGEST, change "basic" to "digest" -->    
    <authMethod>basic</authMethod>
    <authUsername>...</authUsername>
    <authPassword>...</authPassword>
 
    <authHostname>mysite.com</authHostname>
    <authPort>80</authPort>
    <authRealm>My Site Realm</authRealm>
</httpClientFactory>

How to encrypt passwords

Since 2.4.0, it is possible to encrypt passwords used with GenericHttpClientFactory when authenticating to web sites or using a proxy. This can be useful when storing passwords in plain-text is not acceptable.

To encrypt a password you have to invoke the EncryptionUtil class found in the Norconex Commons Lang library. First, open a command prompt and go to the lib directory found under your crawler installation. In that directory you should find a norconex-commons-lang-[version].jar file. Make sure the version is 1.9.0 or higher. From that directory, issue the following command to display the usage options:

java -cp norconex-commons-lang-[version].jar com.norconex.commons.lang.encrypt.EncryptionUtil

Once you read the usage options, run the above command again, adding the appropriate arguments. For instance, the following will use an encryption key found in a file to generate the password:

<above_command> encrypt -f "/path/to/key.txt" "passwordToEncrypt"

Copy the command output and paste it in your configuration file. Using the above example to perform authentication on a web site, the relevant configuration part will look like this:

<httpClientFactory>
    ...
    <authUsername>myusername</authUsername>
    <authPassword>CDKamvEQCBiyfUGDdqyjuFdJlsTOPjNa</authPassword>
    <authPasswordKey>/path/to/key.txt</authPasswordKey>
    <authPasswordKeySource>file</authPasswordKeySource>
    ...
</httpClientFactory>

How to ignore robot rules or sitemaps

Set ignore="true" on any of the following within your crawler configuration:

<!-- To ignore robots.txt files -->
<robotsTxt ignore="true" />
 
<!-- To ignore in-page robot rules -->
<robotsMeta ignore="true" />
 
<!-- To ignore sitemap.xml files -->
<!-- Before 2.3.0: -->
<sitemap ignore="true" />
<!-- Since 2.3.0: -->
<sitemapResolverFactory ignore="true" />

How to follow HTML meta-equiv redirects without indexing original page

Meta-equiv redirects are found in the <head> section of an HTML page and are extracted for processing by default. They look like this:

<META http-equiv="refresh" content="5;URL=http://blah.com/newpage.html">

If you want the URL to be followed you can't filter out the page hosting the redirect tag until its URL is extracted and its content parsed. This mean any filtering actions should take place after document parsing occured. Document parsing is the responsibility of the Importer module. One way to do this is by adding a import RegexMetadataFilter to your crawler configuration:

<importer>
    <postParseHandlers>
        <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" 
                onMatch="exclude" property="refresh">.*</filter>
    </postParseHandlers>
</importer>

If you have specified a prefix to the extracted metadata, make sure to change the property name accordingly.

How to limit crawling to specific sites

By default Norconex Web Crawler will try to follow and crawl every links it detects in pages. Some of these links can lead to different web sites. Often you want to limit crawling to just the sites specified as start URLs. You can always use referenceFilters to only accept URLs matching your sites.

Since version 2.3.0, there is a simpler way. You can tell your crawler to "stay" on the same host, protocol, and/or port as those defined in your start URLs. An example:

<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
    ...
</startURLs>

How to report broken links

Using URLStatusCrawlerEventListener you can specify which HTTP status code you would like to report on, in a file of your chose. Every URLs producing one of the specified status code will be store in a tab-separated-value file along with the status code. For example, to report all "Forbidden" (403) and "Not Found" errors:

<listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
    <statusCodes>403,404</statusCodes>
    <outputDir>/path/to/any/directory</outputDir>
</listener>