public class TitleGeneratorTagger extends AbstractStringTagger implements IXMLConfigurable
Attempts to generate a title from the document content (default) or a specified metadata field. It does not consider a document format to give value to more terms than other. For instance, it would not consider text found in <H1> tags more importantly than other text in HTML documents.
If isDetectHeading()
returns true
, this handler
will check if the content starts with a stand-alone, single-sentence line
(which could be the actual title).
That is, a line of text with only one sentence in it, followed by one or
more new line characters. To help
eliminate cases where such sentence are inappropriate, you can specify a
minimum and maximum number of characters that first line should have
with setDetectHeadingMinLength(int)
and
setDetectHeadingMaxLength(int)
(e.g. to ignore "Page 1" text and
the like).
Unless a target field name is provided, the default field name
where the title will be stored is document.generatedTitle
.
Unless, setOverwrite(boolean)
is set to true
,
no title will be generated if one already exists in the target field.
If it cannot generate a title, it will fall-back to retrieving the first sentence from the text.
The generated title length is limited to 150 characters by default.
You can change that limit by using
setTitleMaxLength(int)
. Text larger than the max limit will be
truncated and three dots will be added in square brackets [...].
To remove the limit,
use -1 (or constant UNLIMITED_TITLE_LENGTH
).
This class should be used as a post-parsing handler only (or otherwise on unformatted text).
Since 2.2.0, the algorithm to detect titles has been much simplified to eliminate extra dependencies that were otherwise not required. It uses a generic statistics-based approach to weight each sentences up to a certain amount, and simply returns the sentence that the most weight given a minimum threshold has been met. You are strongly encouraged to use a more sophisticated summarization engine if you want more accurate titles generated.
Since 2.11.0, this class will only analyze up to the first
10,000 characters (down from 10 millions). You can change this maximum
with AbstractStringTagger.setMaxReadSize(int)
. Given this class is not
optimized for large content analysis, setting a huge amount of characters
could cause serious performance issues on some large files.
<tagger class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger" fromField="(field of text to use/default uses document content)" toField="(target field where to store generated title)" overwrite="[false|true]" titleMaxLength="(max num of chars for generated title)" detectHeading="[false|true]" detectHeadingMinLength="(min length a heading title can have)" detectHeadingMaxLength="(max length a heading title can have)" sourceCharset="(character encoding)" maxReadSize="(max characters to read at once)" > <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> </tagger>
The following will check if the first line looks like a title and if not, it will store the first sentence, up to 200 characters, in a field called title.
<tagger class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger" toField="title" titleMaxLength="200" detectHeading="true" />
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_HEADING_MAX_LENGTH |
static int |
DEFAULT_HEADING_MIN_LENGTH |
static int |
DEFAULT_MAX_READ_SIZE |
static int |
DEFAULT_TITLE_MAX_LENGTH |
static String |
DEFAULT_TO_FIELD |
static int |
UNLIMITED_TITLE_LENGTH |
Constructor and Description |
---|
TitleGeneratorTagger() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
int |
getDetectHeadingMaxLength() |
int |
getDetectHeadingMinLength() |
int |
getFallbackMaxLength()
Deprecated.
Since 2.2.0, use
getTitleMaxLength() |
String |
getFromField() |
int |
getTitleMaxLength() |
String |
getToField() |
int |
hashCode() |
boolean |
isDetectHeading() |
boolean |
isOverwrite() |
protected void |
loadStringTaggerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveStringTaggerToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setDetectHeading(boolean detectHeading) |
void |
setDetectHeadingMaxLength(int detectHeadingMaxLength) |
void |
setDetectHeadingMinLength(int detectHeadingMinLength) |
void |
setFallbackMaxLength(int fallbackMaxLength)
Deprecated.
Since 2.2.0, use
setTitleMaxLength(int) |
void |
setFromField(String fromField) |
void |
setOverwrite(boolean overwrite) |
void |
setTitleMaxLength(int titleMaxLength) |
void |
setToField(String toField) |
protected void |
tagStringContent(String reference,
StringBuilder content,
ImporterMetadata metadata,
boolean parsed,
int sectionIndex) |
String |
toString() |
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
public static final String DEFAULT_TO_FIELD
public static final int DEFAULT_TITLE_MAX_LENGTH
public static final int UNLIMITED_TITLE_LENGTH
public static final int DEFAULT_HEADING_MIN_LENGTH
public static final int DEFAULT_HEADING_MAX_LENGTH
public static final int DEFAULT_MAX_READ_SIZE
protected void tagStringContent(String reference, StringBuilder content, ImporterMetadata metadata, boolean parsed, int sectionIndex) throws ImporterHandlerException
tagStringContent
in class AbstractStringTagger
ImporterHandlerException
public String getToField()
public void setToField(String toField)
public boolean isOverwrite()
public void setOverwrite(boolean overwrite)
public String getFromField()
public void setFromField(String fromField)
public int getTitleMaxLength()
public void setTitleMaxLength(int titleMaxLength)
@Deprecated public int getFallbackMaxLength()
getTitleMaxLength()
@Deprecated public void setFallbackMaxLength(int fallbackMaxLength)
setTitleMaxLength(int)
fallbackMaxLength
- fallback max lengthpublic boolean isDetectHeading()
public void setDetectHeading(boolean detectHeading)
public int getDetectHeadingMinLength()
public void setDetectHeadingMinLength(int detectHeadingMinLength)
public int getDetectHeadingMaxLength()
public void setDetectHeadingMaxLength(int detectHeadingMaxLength)
protected void loadStringTaggerFromXML(org.apache.commons.configuration.XMLConfiguration xml) throws IOException
AbstractStringTagger
loadStringTaggerFromXML
in class AbstractStringTagger
xml
- xml configurationIOException
- could not load from XMLprotected void saveStringTaggerToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractStringTagger
saveStringTaggerToXML
in class AbstractStringTagger
writer
- the xml writerXMLStreamException
- could not save to XMLpublic boolean equals(Object other)
equals
in class AbstractStringTagger
public int hashCode()
hashCode
in class AbstractStringTagger
public String toString()
toString
in class AbstractStringTagger
Copyright © 2009–2021 Norconex Inc.. All rights reserved.