Class RegexContentFilter

  • All Implemented Interfaces:
    IXMLConfigurable, IDocumentFilter, IOnMatchFilter, IImporterHandler

    @Deprecated
    public class RegexContentFilter
    extends AbstractStringFilter
    Deprecated.
    Since 3.0.0, use TextFilter instead.

    Filters a document based on a pattern matching in its content. Based on document size, it is possible the pattern matching will be done in chunks, sometimes not achieving expected results. Consider using AbstractCharStreamFilter if this is a concern. Refer to AbstractDocumentFilter for the inclusion/exclusion logic.

    Since 2.2.0, the following regular expression flags are always active: Pattern.MULTILINE and Pattern.DOTALL.

    XML configuration usage:

      <handler class="com.norconex.importer.handler.filter.impl.RegexContentFilter"
              onMatch="[include|exclude]"
              caseSensitive="[false|true]"
              sourceCharset="(character encoding)"
              maxReadSize="(max characters to read at once)" >
    
          <restrictTo caseSensitive="[false|true]"
                  field="(name of header/metadata field name to match)">
              (regular expression of value to match)
          </restrictTo>
          <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
    
          <regex>(regular expression of value to match)</regex>
      </handler>
     

    Usage example:

    This example will accept only documents containing word "apple".

      <handler class="RegexContentFilter" onMatch="include">
          <regex>.*apple.*</regex>
      </handler>
     
    Since:
    2.0.0
    Author:
    Pascal Essiembre