All Classes and Interfaces

Class
Description
Base class for conditions dealing with the document content as text.
Base class for filters dealing with the body of text documents only.
Base class for taggers dealing with the body of text documents only.
Base class for transformers dealing with text documents only.
Base class for document filters.
Base class for splitters.
Base class for taggers.
Base class for transformers.
Base class for handlers applying only to certain type of documents by providing a way to restrict applicable documents based on a metadata field value, where the value matches a regular expression.
Deprecated.
Since 3.0.0, use composition with OnMatch instead
Base class to facilitate creating conditions based on text content, loading text into StringBuilder for memory processing.
Base class to facilitate creating filters based on text content, loading text into StringBuilder for memory processing.
Base class to facilitate creating taggers based on text content, loading text into StringBuilder for memory processing.
Base class to facilitate creating transformers on text content, loading text into a StringBuilder for memory processing.
Base class wrapping Apache Tika parser for use by the importer.
 
 
A condition based on whether the document content (default) or any of the specified metadata fields are blank or inexistent.
Buffer related utility methods.
Changes the character case of matching fields and values according to one of the following methods:
Converts one or more field values (if needed) from a source character encoding (charset) to a target one.
Transforms a document content (if needed) from a source character encoding (charset) to a target one.
Character set utility methods.
Commonly used TextMatcher instances.
Commonly encountered restrictions that can be applied to Properties instances.
Define and add constant values to documents.
Deprecated.
Master class to detect all content types.
Copies metadata fields.
Counts the number of matches of a given string (or string pattern) and store the resulting value in a field in the specified "toField".
Deprecated.
Split files with Coma-Separated values (or any other characters, like tab) into one document per line.
Adds the current computer UTC date to the specified field.
A condition based on the date value(s) of matching metadata fields given the supplied date format.
 
 
 
 
 
Formats a date from any given format to a format of choice, as per the formatting options found on SimpleDateFormat with the exception of the string "EPOCH" which represents the difference, measured in milliseconds, between the date and midnight, January 1, 1970.
Accepts or rejects a document based on whether field values correspond to a date matching supplied conditions and format.
 
 
 
 
 
 
A utility tagger to help with troubleshooting of document importing.
Delete the metadata fields provided.
A document being imported.
Important information about a document that has specific meaning and purpose for processing by the Importer and needs to be referenced in a constant way.
Constants for common metadata field names typically associated with a document and often set on Doc.getMetadata().
Adds the document length (i.e., number of bytes) to the specified field.
Exception thrown upon encountering a non-recoverable issue parsing a document.
A condition using a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content to match an element, attribute or value.
Deprecated.
Since 3.0.0, use DOMFilter.
Enables deletion of one or more elements matching a given selector from a document content.
Uses a Document Object Model (DOM) representation of an HTML, XHTML, or XML document content to perform filtering based on matching an element/attribute or element/attribute value.
Preserves only one or more elements matching a given selector from a document content.
DOM Extraction Details
Splits HTML, XHTML, or XML document on elements matching a given selector.
Extract the value of one or more elements or attributes into a target field, or delete matching elements.
DOM Extraction Details
Utility methods related to JSoup/DOM manipulation.
Configuration settings affecting how embedded documents are handled by parsers.
Accepts or rejects a document based on whether its content (default) or any of the specified metadata fields are empty or not.
Deprecated.
Since 3.0.0, use EmptyFilter.
Class executing an external application to extract data from and/or manipulate a document.
Parses and extracts text from a file using an external application to do so.
Extracts metadata from a document using an external application to do so.
Transforms a document using an external application to do so.
Parser using auto-detection of document content-type to figure out which specific parser to invoke to best parse a document.
A utility tagger that reports in a CSV file the fields discovered in a crawl session, captured at the point of your choice in the importing process.
Forces a metadata field to be single-value.
Utility methods related to formatting.
Generic document parser factory.
Grobid REST service configuration.
Consumer wrapping an IImporterHandler instance for use in an XMLFlow.
 
 
Lighter version of Doc which leaves content out to let each handler dictate how content should be referenced.
Predicate wrapping an IImporterCondition instance for use in an XMLFlow.
Given a separator, split a field string into multiple segments representing each node of a hierarchical branch.
 
Filters documents.
Implementations are responsible for parsing a document to extract its text and metadata, as well as any embedded documents (when applicable).
Factory providing document parsers for documents.
Responsible for splitting a single document into several ones.
Tags a document with extra metadata information, or manipulate existing metadata information.
Transformers allow to manipulate and modify a document metadata or content.
Indicates that a parser can be initialized with generic parser configuration settings and it will try to apply any such settings the best it can when possible to do so.
A condition usually used in XML flow creation when configuring importer handlers.
Identifies a class as being an import handler.
Processes an importer response to modify it or perform other actions as required before it is returned.
Transforms an image using common image operations.
Principal class responsible for importing documents.
Importer configuration.
An Importer event.
 
Exception thrown when an issue prevented the proper importation of a file.
Exception thrown by several handler classes upon encountering issues.
Command line launcher of the Importer application.
An Importer request, unique for each document to be imported.
 
RuntimeException thrown when a an issue prevented the proper importation of a file.
 
 
Tells the collector that a filter is of "OnMatch" type.
Keep only the metadata fields provided, delete all other ones.
Detects a document language based on Apache Tika language detection capability.
Merge multiple metadata fields into a single one.
 
Get rid of the content stream and optionally store it as text into a metadata field instead.
A condition based on the numeric value(s) of matching metadata fields, supporting decimals.
 
Accepts or rejects a document based on the numeric value(s) of matching metadata fields, supporting decimals.
 
 
OCR configuration details.
Constants indicating the action to perform upon matching a condition.
Configuration settings influencing how documents are parsed by various parsers.
Act as a flag indicating if a document has been parsed or not in a given process flow.
Split PDFs pages so each pages are treated as individual documents.
Reduces specified consecutive characters or strings to only one instance (document content only).
A condition based on a text pattern matching a document reference (e.g.
Accepts or rejects a document based on its reference (e.g.
Deprecated.
Since 3.0.0, use TextFilter instead.
Deprecated.
Since 3.0.0, use RegexFieldValueExtractor from Norconex Commons Lang
Deprecated.
Since 3.0.0, use TextFilter instead.
Deprecated.
Since 3.0.0, use ReferenceFilter instead.
Extracts field names and their values with regular expression.
Deprecated.
Since 3.0.0, use RegexFieldValueExtractor from Norconex Commons Lang
Rejects a document.
Rename metadata fields to different names.
 
Replaces an existing metadata value with another one.
 
Replaces every occurrences of the given replacements (document content only).
 
A condition formulated using a scripting language.
Filter incoming documents using a scripting language.
Runs scripts written in a programming language supported by the provided script engine.
Tag incoming documents using a scripting language.
Transform incoming documents using a scripting language.
Splits an existing metadata value into multiple values based on a given value separator (the separator gets discarded).
 
Strips any content found after first match found for given pattern.
Strips any content found before first match found for given pattern.
Strips any content found between a matching start and end strings.
 
Keep a substring of the content matching a begin and end character indexes.
Extracts and add values found between a matching start and end strings to a document metadata field.
 
A condition based on a text pattern matching a document content (default), or matching specific field(s).
Filters a document based on a text pattern in a document content (default), or matching fields specified.
Deprecated.
Since 3.0.0, use RegexTagger.
Analyzes the content of the supplied document and adds statistical information about its content or field as metadata fields.
Attempts to generate a title from the document content (default) or a specified metadata field.
Translate documents using one of the supported translation API.
Truncates a fromField value(s) and optionally replace truncated portion by a hash value to help ensure uniqueness (not 100% guaranteed to be collision-free).
Extracts unique URLs matching specific patterns in plain text content and store them in a given field.
Generates a random Universally unique identifier (UUID) and stores it in the specified field.
Parser for PureEdge Extensible Forms Description Language (XFDL).
Splits XML document on a specific element.