Modifier and Type | Method and Description |
---|---|
IImporterHandler[] |
ImporterConfig.getPostParseHandlers() |
IImporterHandler[] |
ImporterConfig.getPreParseHandlers() |
Modifier and Type | Method and Description |
---|---|
void |
ImporterConfig.setPostParseHandlers(IImporterHandler... handlers) |
void |
ImporterConfig.setPreParseHandlers(IImporterHandler... handlers) |
Modifier and Type | Interface and Description |
---|---|
interface |
IDocumentFilter
Filters documents.
|
Modifier and Type | Class and Description |
---|---|
class |
AbstractCharStreamFilter
Base class for filters dealing with the body of text documents only.
|
class |
AbstractDocumentFilter
Base class for document filters.
|
class |
AbstractStringFilter
Base class to facilitate creating filters based on text content, loading
text into
StringBuilder for memory processing. |
Modifier and Type | Class and Description |
---|---|
class |
DateMetadataFilter
Accepts or rejects a document based on the date value(s) of a metadata
field, stored in a specified format.
|
class |
DOMContentFilter
Uses a Document Object Model (DOM) representation of an HTML, XHTML, or
XML document content to perform filtering based on matching an
element/attribute or element/attribute value.
|
class |
EmptyMetadataFilter
Accepts or rejects a document based on whether any of the specified
metadata fields are empty or not.
|
class |
NumericMetadataFilter
Accepts or rejects a document based on the numeric value(s) of a metadata
field, supporting decimals.
|
class |
RegexContentFilter
Filters a document based on a pattern matching in its content.
|
class |
RegexMetadataFilter
Accepts or rejects a document based on its field values using
regular expression.
|
class |
RegexReferenceFilter
Accepts or rejects a document based on its reference (e.g.
|
class |
ScriptFilter
Filter incoming documents using a scripting language.
|
Modifier and Type | Interface and Description |
---|---|
interface |
IDocumentSplitter
Responsible for splitting a single document into several ones.
|
Modifier and Type | Class and Description |
---|---|
class |
AbstractDocumentSplitter
Base class for splitters.
|
Modifier and Type | Class and Description |
---|---|
class |
CsvSplitter
Split files with Coma-Separated values (or any other characters, like tab)
into one document per line.
|
class |
DOMSplitter
Splits HTML, XHTML, or XML document on a specific element.
|
class |
PDFPageSplitter
Split PDFs pages so each pages are treated as individual documents.
|
class |
TranslatorSplitter
Translate documents using one of the supported translation API.
|
Modifier and Type | Interface and Description |
---|---|
interface |
IDocumentTagger
Tags a document with extra metadata information, or manipulate existing
metadata information.
|
Modifier and Type | Class and Description |
---|---|
class |
AbstractCharStreamTagger
Base class for taggers dealing with the body of text documents only.
|
class |
AbstractDocumentTagger
Base class for taggers.
|
class |
AbstractStringTagger
Base class to facilitate creating taggers based on text content, loading
text into
StringBuilder for memory processing. |
Modifier and Type | Class and Description |
---|---|
class |
CharacterCaseTagger
Changes the character case of field values according to one of the
following methods:
|
class |
CharsetTagger
Converts one or more field values (if needed) from a source character
encoding (charset) to a target one.
|
class |
ConstantTagger
Define and add constant values to documents.
|
class |
CopyTagger
Copies metadata fields.
|
class |
CountMatchesTagger
Counts the number of matches of a given string (or string pattern) and
store the resulting value in a field in the specified "toField".
|
class |
CurrentDateTagger
Adds the current computer UTC date to the specified
field . |
class |
DateFormatTagger
Formats a date from any given format to a format of choice, as per the
formatting options found on
SimpleDateFormat with the exception
of the string "EPOCH" which represents the difference, measured in
milliseconds, between the date and midnight, January 1, 1970. |
class |
DebugTagger
A utility tagger to help with troubleshooting of document importing.
|
class |
DeleteTagger
Delete the metadata fields provided.
|
class |
DocumentLengthTagger
Adds the document length (i.e., number of bytes) to
the specified
field . |
class |
DOMTagger
Extract the value of one or more elements or attributes into
a target field, from and HTML, XHTML, or XML document.
|
class |
ExternalTagger
Extracts metadata from a document using an external application to do so.
|
class |
FieldReportTagger
A utility tagger that reports in a CSV file the fields discovered
in a crawl session, captured at the point of your choice in the
importing process.
|
class |
ForceSingleValueTagger
Forces a metadata field to be single-value.
|
class |
HierarchyTagger
Given a separator, split a field string into multiple segments
representing each node of a hierarchical branch.
|
class |
KeepOnlyTagger
Keep only the metadata fields provided, delete all other ones.
|
class |
LanguageTagger
Detects a document language based on Tika language detection capability.
|
class |
MergeTagger
Merge multiple metadata fields into a single one.
|
class |
RenameTagger
Rename metadata fields to different names.
|
class |
ReplaceTagger
Replaces an existing metadata value with another one.
|
class |
ScriptTagger
Tag incoming documents using a scripting language.
|
class |
SplitTagger
Splits an existing metadata value into multiple values based on a given
value separator.
|
class |
TextBetweenTagger
Extracts and add values found between a matching start and
end strings to a document metadata field.
|
class |
TextPatternTagger
Extracts and add all text values matching the regular expression provided
in to a field provided explicitely, or also matching a regular
expression.
|
class |
TextStatisticsTagger
Analyzes the content of the supplied document and adds statistical
information about its content or field as metadata fields.
|
class |
TitleGeneratorTagger
Attempts to generate a title from the document content (default) or
a specified metadata field.
|
class |
TruncateTagger
Truncates a
fromField value(s) and optionally replace truncated
portion by a hash value to help ensure uniqueness (not 100% guaranteed to
be collision-free). |
class |
UUIDTagger
Generates a random Universally unique identifier (UUID) and stores it
in the specified
field . |
Modifier and Type | Interface and Description |
---|---|
interface |
IDocumentTransformer
Transformers allow to manipulate and modify a document metadata or content.
|
Modifier and Type | Class and Description |
---|---|
class |
AbstractCharStreamTransformer
Base class for transformers dealing with text documents only.
|
class |
AbstractDocumentTransformer
Base class for transformers.
|
class |
AbstractStringTransformer
Base class to facilitate creating transformers on text content, loading
text into a
StringBuilder for memory processing. |
Modifier and Type | Class and Description |
---|---|
class |
CharsetTransformer
Transforms a document content (if needed) from a source character
encoding (charset) to a target one.
|
class |
ExternalTransformer
Transforms a document content using an external application to do so.
|
class |
NoContentTransformer
Get rid of the content stream and optionally store it as text into a
metadata field instead.
|
class |
ReduceConsecutivesTransformer
Reduces specified consecutive characters or strings to only one
instance (document content only).
|
class |
ReplaceTransformer
Replaces every occurrences of the given replacements
(document content only).
|
class |
ScriptTransformer
Transform incoming documents using a scripting language.
|
class |
StripAfterTransformer
Strips any content found after first match found for given pattern.
|
class |
StripBeforeTransformer
Strips any content found before first match found for given pattern.
|
class |
StripBetweenTransformer
Strips any content found between a matching start and end strings.
|
class |
SubstringTransformer
Keep a substring of the content matching a begin and end character
indexes.
|
Copyright © 2009–2021 Norconex Inc.. All rights reserved.