public class TitleGeneratorTagger extends AbstractStringTagger implements IXMLConfigurable
Attempts to generate a title from the document content (default) or a specified metadata field. It does not consider the document format/structure nor does it weight some terms more than others. For instance, it would not consider text found in <H1> tags more importantly than other text in HTML documents.
If isDetectHeading()
returns true
, this handler
will check if the content starts with a stand-alone, single-sentence line
(which is assumed to be the actual title).
That is, a line of text with only one sentence in it, followed by one or
more new line characters. To help
eliminate cases where such sentence are inappropriate, you can specify a
minimum and maximum number of characters that first line should have
with setDetectHeadingMinLength(int)
and
setDetectHeadingMaxLength(int)
(e.g. to ignore "Page 1" text and
the like).
Unless a target field name is provided, the default field name
where the title will be stored is document.generatedTitle
.
If a target field with the same name already exists for a document,
values will be added to the end of the existing value list.
It is possible to change this default behavior by supplying a
PropertySetter
.
If it cannot generate a title, it will fall-back to retrieving the first sentence from the text.
The generated title length is limited to 150 characters by default.
You can change that limit by using
setTitleMaxLength(int)
. Text larger than the max limit will be
truncated and three dots will be added in square brackets ([...]).
To remove the limit,
use -1 (or constant UNLIMITED_TITLE_LENGTH
).
This class should be used as a post-parsing handler only (or otherwise on unformatted text).
The algorithm to detect titles is quite basic. It uses a generic statistics-based approach to weight each sentences up to a certain amount, and simply returns the sentence with the highest attributed weight given a minimum threshold has been met. You are strongly encouraged to use a more sophisticated summarization engine if you want more accurate titles generated.
This tagger will only analyze up to the first
10,000 characters. You can change this maximum
with AbstractStringTagger.setMaxReadSize(int)
. Given this class is not
optimized for large content analysis, setting a huge maximum number
of characters could cause serious performance issues on large
large files.
<handler
class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger"
maxReadSize="(max characters to read at once)"
sourceCharset="(character encoding)"
fromField="(field of text to use/default uses document content)"
toField="(target field where to store generated title)"
onSet="[append|prepend|replace|optional]"
titleMaxLength="(max num of chars for generated title)"
detectHeading="[false|true]"
detectHeadingMinLength="(min length a heading title can have)"
detectHeadingMaxLength="(max length a heading title can have)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
</handler>
<handler
class="TitleGeneratorTagger"
toField="title"
titleMaxLength="200"
detectHeading="true"/>
The above will check if the first line looks like a title and if not, it will store the first sentence, up to 200 characters, in a field called title.
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_HEADING_MAX_LENGTH |
static int |
DEFAULT_HEADING_MIN_LENGTH |
static int |
DEFAULT_MAX_READ_SIZE |
static int |
DEFAULT_TITLE_MAX_LENGTH |
static String |
DEFAULT_TO_FIELD |
static int |
UNLIMITED_TITLE_LENGTH |
Constructor and Description |
---|
TitleGeneratorTagger() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
int |
getDetectHeadingMaxLength() |
int |
getDetectHeadingMinLength() |
String |
getFromField() |
PropertySetter |
getOnSet()
Gets the property setter to use when a value is set.
|
int |
getTitleMaxLength() |
String |
getToField() |
int |
hashCode() |
boolean |
isDetectHeading() |
boolean |
isOverwrite()
Deprecated.
Since 3.0.0 use
getOnSet() . |
protected void |
loadStringTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveStringTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setDetectHeading(boolean detectHeading) |
void |
setDetectHeadingMaxLength(int detectHeadingMaxLength) |
void |
setDetectHeadingMinLength(int detectHeadingMinLength) |
void |
setFromField(String fromField) |
void |
setOnSet(PropertySetter onSet)
Sets the property setter to use when a value is set.
|
void |
setOverwrite(boolean overwrite)
Deprecated.
Since 3.0.0 use
setOnSet(PropertySetter) . |
void |
setTitleMaxLength(int titleMaxLength) |
void |
setToField(String toField) |
protected void |
tagStringContent(HandlerDoc doc,
StringBuilder content,
ParseState parseState,
int sectionIndex) |
String |
toString() |
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
public static final String DEFAULT_TO_FIELD
public static final int DEFAULT_TITLE_MAX_LENGTH
public static final int UNLIMITED_TITLE_LENGTH
public static final int DEFAULT_HEADING_MIN_LENGTH
public static final int DEFAULT_HEADING_MAX_LENGTH
public static final int DEFAULT_MAX_READ_SIZE
protected void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
tagStringContent
in class AbstractStringTagger
ImporterHandlerException
public String getToField()
public void setToField(String toField)
@Deprecated public boolean isOverwrite()
getOnSet()
.true
if overwriting existing value.@Deprecated public void setOverwrite(boolean overwrite)
setOnSet(PropertySetter)
.overwrite
- true
if overwriting existing value.public String getFromField()
public void setFromField(String fromField)
public int getTitleMaxLength()
public void setTitleMaxLength(int titleMaxLength)
public boolean isDetectHeading()
public void setDetectHeading(boolean detectHeading)
public int getDetectHeadingMinLength()
public void setDetectHeadingMinLength(int detectHeadingMinLength)
public int getDetectHeadingMaxLength()
public void setDetectHeadingMaxLength(int detectHeadingMaxLength)
public PropertySetter getOnSet()
public void setOnSet(PropertySetter onSet)
onSet
- property setterprotected void loadStringTaggerFromXML(XML xml)
AbstractStringTagger
loadStringTaggerFromXML
in class AbstractStringTagger
xml
- xml configurationprotected void saveStringTaggerToXML(XML xml)
AbstractStringTagger
saveStringTaggerToXML
in class AbstractStringTagger
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractStringTagger
public int hashCode()
hashCode
in class AbstractStringTagger
public String toString()
toString
in class AbstractStringTagger
Copyright © 2009–2023 Norconex Inc.. All rights reserved.