Norconex Filesystem Collector

FAQ and HOWTOs

If you do not find answers to your questions here, please ask your question on GitHub and it may find its way here.

All Collectors

What file formats are supported?

The parsing of downloaded files is performed by the Norconex Importer. You can read on its web site the full list of supported file formats.

How to prevent fields from being added to a document

DeleteTagger and KeeyOnlyTagger will help you produce just the fields you want. The later is probably the one you want in most cases. The following shows how to eliminate all fields from a document, except for the document reference, keywords, and description fields:

<importer>
    <postParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>document.reference, keywords, description</fields> 
        </tagger> 
    </postParseHandlers>
</importer>

How to chose the right Crawl Database implementation

Norconex Collectors need a database to store key reference information about a collected document (URL, path, etc.). Three implementations are offered out-of-the-box: MVStore, MapDB, MongoDB, and JDBC (Derby or H2). Prior to version 2.5.0 of both HTTP and Filesystem collectors, MapDB was the default implementation. Since version 2.5.0 of these collectors, MVStore is now the default implementation. Using the default implementation does not require explicit configuration. The following will help you decide which one is the right one for you:

  • MVStore: Fast key/value database. It should address the vast majority of cases. MVStore performs really well on a single server, using a mix between memory and the local file-system. When in doubt, use this one. If you run into issues with MVStore, MapDB is a good alternative to try.

  • MapDB: Fast key/value database like MapDB.

  • MongoDB: When dealing with several millions or billions of document references with MVStore or MapDB, the disk space required to store them may grow too large for you. For average scenarios, MongoDB should not bring any performance gain, but it allows you to use a distributed cluster to store references on huge crawls.

  • JDBC: The slowest implementation by far. For relatively small crawls, Derby/H2 are excellent if you want to issue SQL queries against crawled URLs, for reporting or else.

How to setup a good directory structure

There is a multitude of ways you can go at this. The following example ensures generated files are kept together for each collectors you define, favors configuration re-use, and ensures portability across environments. Replace <ROOT_DIR> with the directory of your choice. It does not have to be within the directory where you have installed Norconex Collector (but can if you like). Choosing a separate root directory for your configuration may be a good idea to put emphasis that the same Collector can be shared by any configuration instance you create. In other words, you can have many concurrent Norconex Collector instances running, each having their own configuration, but all sharing the same binaries (no need to have multiple installs unless you are testing different versions). In this example, we store the Collector next to other directories to demonstrate that.

<ROOT_DIR>/
    norconex-collector-xxxx-x.x.x/ <-- Collector installation
    shared_configs/                <-- contains fragments for all your configs.
    collectors/                    <-- one sub-directory per collector you have
        myCollectorA/              <-- Your first collector
            config/                <-- All configs specific to myCollectorA
            workdir/               <-- All myCollectorB crawler generated files
        myCollectorB/              <-- Your second collector
            config/                <-- All configs specific to myCollectorA
            workdir/               <-- All myCollectorB crawler generated files
        ...

To take advantage of this directory structure, you have to update your configuration file(s) accordingly. Let's assume we are working on "myCollectorA". The first thing you want to do to ensure portability is to abstract your environment specific values in a variables file. In this case we'll assume the path is different. We'll store our path in a file named <ROOT_DIR>/collectors/myCollectorA/config/collectorA-config.variables and it will contain the following:

workdir = <ROOT_DIR>/collectors/myCollectorA/workdir/

The workdir variable can be referenced in your Collector configuration with the dollar sign prefix $workdir or ${workdir}. The variable file will be automatically loaded if you store it in the same folder as your main config, with the same name (but different extension): <ROOT_DIR>/collectors/myCollectorA/config/collectorA-config.xml. You can specify an alternate variables file path and name when you specify it as an extra argument when you launch the Collector. We will use our variable to store every generated file in our working directory (HTTP Collector example):

<httpcollector id="my-Collector-A">
  <!-- Uncomment the following to hard-code the working directory instead:
  ##set($workdir = "<ROOT_DIR>/collectors/myCollectorA/workdir/")
    -->
  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>
 
  ...
 
  <!-- The following takes advantaged of sharing configuration files. -->
  <importer>
     #parse("../../../shared_configs/shared-imports.cfg")
  </importer>
 
  <crawler id="any-crawler">
    <workDir>$workdir/any-crawler</workDir>
    ...
  </crawler>
</httpcollector>

How to change log levels

Locate the log4j.properties file located in the root of your HTTP Collector installation directory. It is self-documented enough to change the log level for most frequent scenarios (do not log OK urls, log rejection details, etc). You can adjust the log level for other particular classes of a Collector or even your own. Visit log4j site for more information.

One collector with several crawlers or a crawler with several starting sources?

The following is an edited version of a question answered on Github.

There are no fixed rules for this. It often comes down to whatever is easier to maintain for you. The following are considerations that can help you decide the best approach for you:

  • Ensure all starting references (e.g. URLs, paths) you put under the same crawler share the same settings or crawl requirements. For instance, two web sites may need different authentication, one you may want to strip its header on every page, the other the sidebar, etc. Your configuration settings may not always apply to each sites and cause trouble. In such case set them in different crawlers/collectors.

  • What if you have 1 of your 10 sources under 1 crawler failing to get crawled properly because of a bad configuration or else? Then you will have to either restart the crawl with the 10 sources just to troubleshoot that one, or you may comment the other ones while you troubleshoot. None are always ideal. You may want to separate your sources in different crawlers to avoid this, especially when in heavy development stage.

  • The above scenario also applies if you have multiple crawlers defined. If only one needs restarting, you'll be forced to restart them all. Separating them in different collectors can sometimes work best.

  • Do you have 1000+ sources to crawl? It maybe a big challenge to apply the above tips with so many. You do not want to create tons of different crawlers/collectors. In which case, it may be a good mixed approach to put long-running, or more complex sources in their own crawler or collector configuration, and put several smaller/simple sources together.

  • If crawling many sources with several threads is becoming too taxing on your hardware resources, you may want to use multiple servers to spread the load. This is obviously a very good reason to have more than one collector.

  • Putting the above considerations aside, one good reason for having multiple crawlers under the same collector, is if most are sharing common values, with a few exceptions only. You can define most settings only once under <crawlerDefaults> and have the specifics under each specific crawlers. Keep in mind you can also share configuration snippets between different configuration files.

How to solve "Too many open files" errors

This is an common issue on Linux/Unix systems when the number of file handles that can be open at once by a process is limited. Norconex Collectors use third party libraries for parsing files. The parsing of certain file types or files with several embedded objects can sometime produce a large quantity of temporary files. This heavy reliance on temporary files is often an approach taken to avoid using too much memory. Luckily, operating systems can be configured to increase the "open file" limit (or make it unlimited). You can find several instructions online to solve this problem, like this one.