While Norconex Importer works out-of-the-box with its default settings, you will only unlock its full potential if you take time to configure it properly using Java or XML.
Refer to the following for an XML based configuration. Entries with a "class" attribute are expecting an implementation of your choice. The importer API offers several concrete implementations already. Developers can also create their own by implementing the proper Java interfaces. Refer to the Importer JavaDoc and/or see further down what interfaces you can implement to provide custom functionality. Got to the Extend the Importer section for more details on adding your own implementations.
<importer> <tempDir></tempDir> <maxFileCacheSize></maxFileCacheSize> <maxFilePoolCacheSize></maxFilePoolCacheSize> <parseErrorsSaveDir></parseErrorsSaveDir> <preParseHandlers> <!-- These tags can be mixed, in the desired order of execution. --> <tagger class="..." /> <transformer class="..." /> <filter class="..." /> <splitter class="..." /> </preParseHandlers> <documentParserFactory class="..." /> <postParseHandlers> <!-- These tags can be mixed, in the desired order of execution. --> <tagger class="..." /> <transformer class="..." /> <filter class="..." /> <splitter class="..." /> </postParseHandlers> <responseProcessors> <responseProcessor class="..." /> </responseProcessors> </importer>
The table below lists interface names that you can easily extend, and also lists available out-of-the-box implementations.
In the configuration file, you have to use the fully qualified name, as defined in the Javadoc (you can use variables to shorten package names). Click on a class or interface name to go directly to its full documentation, with extra configuration options.
When a default implementation exists for a configuration option taking
a class
attribute, it is highlighted.
Pretend you are building a service that offers content extracted from documents of various nature. You have a special batch that you want your system to treat as "News" documents. You want to add a metadata value to each of these documents to mark them as such. You also noticed that some of these documents are HTML files with two "title" meta tags, and you want to keep only the first one encountered to avoid possible issues. The following will accomplish this for you:
<importer> <postParseHandlers> <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger"> <constant name="doctype">News</constant> </tagger> <tagger class="com.norconex.importer.handler.tagger.impl.SingleValueTagger"> <singleValue field="title" action="keepFirst"/> </tagger> </postParseHandlers> </importer>
There is a lot more you can do to structure your configuration files the way you like. Refer to this additional documentation for more configuration options such as creating reusable configuration fragments and using variables to make your file easier to maintain and more portable across different environments.