Class TitleGeneratorTagger
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.tagger.AbstractDocumentTagger
-
- com.norconex.importer.handler.tagger.AbstractCharStreamTagger
-
- com.norconex.importer.handler.tagger.AbstractStringTagger
-
- com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger
-
- All Implemented Interfaces:
IXMLConfigurable
,IImporterHandler
,IDocumentTagger
public class TitleGeneratorTagger extends AbstractStringTagger implements IXMLConfigurable
Attempts to generate a title from the document content (default) or a specified metadata field. It does not consider the document format/structure nor does it weight some terms more than others. For instance, it would not consider text found in <H1> tags more importantly than other text in HTML documents.
If
isDetectHeading()
returnstrue
, this handler will check if the content starts with a stand-alone, single-sentence line (which is assumed to be the actual title). That is, a line of text with only one sentence in it, followed by one or more new line characters. To help eliminate cases where such sentence are inappropriate, you can specify a minimum and maximum number of characters that first line should have withsetDetectHeadingMinLength(int)
andsetDetectHeadingMaxLength(int)
(e.g. to ignore "Page 1" text and the like).Unless a target field name is provided, the default field name where the title will be stored is
document.generatedTitle
.Storing values in an existing field
If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a
PropertySetter
.If it cannot generate a title, it will fall-back to retrieving the first sentence from the text.
The generated title length is limited to 150 characters by default. You can change that limit by using
setTitleMaxLength(int)
. Text larger than the max limit will be truncated and three dots will be added in square brackets ([...]). To remove the limit, use -1 (or constantUNLIMITED_TITLE_LENGTH
).This class should be used as a post-parsing handler only (or otherwise on unformatted text).
The algorithm to detect titles is quite basic. It uses a generic statistics-based approach to weight each sentences up to a certain amount, and simply returns the sentence with the highest attributed weight given a minimum threshold has been met. You are strongly encouraged to use a more sophisticated summarization engine if you want more accurate titles generated.
Max read size
This tagger will only analyze up to the first 10,000 characters. You can change this maximum with
AbstractStringTagger.setMaxReadSize(int)
. Given this class is not optimized for large content analysis, setting a huge maximum number of characters could cause serious performance issues on large large files.XML configuration usage:
<handler class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger" maxReadSize="(max characters to read at once)" sourceCharset="(character encoding)" fromField="(field of text to use/default uses document content)" toField="(target field where to store generated title)" titleMaxLength="(max num of chars for generated title)" detectHeading="[false|true]" detectHeadingMinLength="(min length a heading title can have)" detectHeadingMaxLength="(max length a heading title can have)"> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <restrictTo> <fieldMatcher>(field-matching expression)</fieldMatcher> <valueMatcher>(value-matching expression)</valueMatcher> </restrictTo> </handler>
XML usage example:
<handler class="TitleGeneratorTagger" toField="title" titleMaxLength="200" detectHeading="true"/>
The above will check if the first line looks like a title and if not, it will store the first sentence, up to 200 characters, in a field called title.
- Since:
- 2.1.0
- Author:
- Pascal Essiembre
-
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_HEADING_MAX_LENGTH
static int
DEFAULT_HEADING_MIN_LENGTH
static int
DEFAULT_MAX_READ_SIZE
static int
DEFAULT_TITLE_MAX_LENGTH
static String
DEFAULT_TO_FIELD
static int
UNLIMITED_TITLE_LENGTH
-
Constructor Summary
Constructors Constructor Description TitleGeneratorTagger()
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description boolean
equals(Object other)
int
getDetectHeadingMaxLength()
int
getDetectHeadingMinLength()
String
getFromField()
PropertySetter
getOnSet()
Gets the property setter to use when a value is set.int
getTitleMaxLength()
String
getToField()
int
hashCode()
boolean
isDetectHeading()
boolean
isOverwrite()
Deprecated.Since 3.0.0 usegetOnSet()
.protected void
loadStringTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected void
saveStringTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setDetectHeading(boolean detectHeading)
void
setDetectHeadingMaxLength(int detectHeadingMaxLength)
void
setDetectHeadingMinLength(int detectHeadingMinLength)
void
setFromField(String fromField)
void
setOnSet(PropertySetter onSet)
Sets the property setter to use when a value is set.void
setOverwrite(boolean overwrite)
Deprecated.Since 3.0.0 usesetOnSet(PropertySetter)
.void
setTitleMaxLength(int titleMaxLength)
void
setToField(String toField)
protected void
tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex)
String
toString()
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractStringTagger
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML
-
-
-
-
Field Detail
-
DEFAULT_TO_FIELD
public static final String DEFAULT_TO_FIELD
- See Also:
- Constant Field Values
-
DEFAULT_TITLE_MAX_LENGTH
public static final int DEFAULT_TITLE_MAX_LENGTH
- See Also:
- Constant Field Values
-
UNLIMITED_TITLE_LENGTH
public static final int UNLIMITED_TITLE_LENGTH
- See Also:
- Constant Field Values
-
DEFAULT_HEADING_MIN_LENGTH
public static final int DEFAULT_HEADING_MIN_LENGTH
- See Also:
- Constant Field Values
-
DEFAULT_HEADING_MAX_LENGTH
public static final int DEFAULT_HEADING_MAX_LENGTH
- See Also:
- Constant Field Values
-
DEFAULT_MAX_READ_SIZE
public static final int DEFAULT_MAX_READ_SIZE
- See Also:
- Constant Field Values
-
-
Method Detail
-
tagStringContent
protected void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
- Specified by:
tagStringContent
in classAbstractStringTagger
- Throws:
ImporterHandlerException
-
getToField
public String getToField()
-
setToField
public void setToField(String toField)
-
isOverwrite
@Deprecated public boolean isOverwrite()
Deprecated.Since 3.0.0 usegetOnSet()
.Gets whether existing value for the same field should be overwritten.- Returns:
true
if overwriting existing value.
-
setOverwrite
@Deprecated public void setOverwrite(boolean overwrite)
Deprecated.Since 3.0.0 usesetOnSet(PropertySetter)
.Sets whether existing value for the same field should be overwritten.- Parameters:
overwrite
-true
if overwriting existing value.
-
getFromField
public String getFromField()
-
setFromField
public void setFromField(String fromField)
-
getTitleMaxLength
public int getTitleMaxLength()
-
setTitleMaxLength
public void setTitleMaxLength(int titleMaxLength)
-
isDetectHeading
public boolean isDetectHeading()
-
setDetectHeading
public void setDetectHeading(boolean detectHeading)
-
getDetectHeadingMinLength
public int getDetectHeadingMinLength()
-
setDetectHeadingMinLength
public void setDetectHeadingMinLength(int detectHeadingMinLength)
-
getDetectHeadingMaxLength
public int getDetectHeadingMaxLength()
-
setDetectHeadingMaxLength
public void setDetectHeadingMaxLength(int detectHeadingMaxLength)
-
getOnSet
public PropertySetter getOnSet()
Gets the property setter to use when a value is set.- Returns:
- property setter
- Since:
- 3.0.0
-
setOnSet
public void setOnSet(PropertySetter onSet)
Sets the property setter to use when a value is set.- Parameters:
onSet
- property setter- Since:
- 3.0.0
-
loadStringTaggerFromXML
protected void loadStringTaggerFromXML(XML xml)
Description copied from class:AbstractStringTagger
Loads configuration settings specific to the implementing class.- Specified by:
loadStringTaggerFromXML
in classAbstractStringTagger
- Parameters:
xml
- xml configuration
-
saveStringTaggerToXML
protected void saveStringTaggerToXML(XML xml)
Description copied from class:AbstractStringTagger
Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Specified by:
saveStringTaggerToXML
in classAbstractStringTagger
- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractStringTagger
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractStringTagger
-
toString
public String toString()
- Overrides:
toString
in classAbstractStringTagger
-
-