Class TitleGeneratorTagger
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.tagger.AbstractDocumentTagger
-
- com.norconex.importer.handler.tagger.AbstractCharStreamTagger
-
- com.norconex.importer.handler.tagger.AbstractStringTagger
-
- com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger
-
- All Implemented Interfaces:
IXMLConfigurable,IImporterHandler,IDocumentTagger
public class TitleGeneratorTagger extends AbstractStringTagger implements IXMLConfigurable
Attempts to generate a title from the document content (default) or a specified metadata field. It does not consider the document format/structure nor does it weight some terms more than others. For instance, it would not consider text found in <H1> tags more importantly than other text in HTML documents.
If
isDetectHeading()returnstrue, this handler will check if the content starts with a stand-alone, single-sentence line (which is assumed to be the actual title). That is, a line of text with only one sentence in it, followed by one or more new line characters. To help eliminate cases where such sentence are inappropriate, you can specify a minimum and maximum number of characters that first line should have withsetDetectHeadingMinLength(int)andsetDetectHeadingMaxLength(int)(e.g. to ignore "Page 1" text and the like).Unless a target field name is provided, the default field name where the title will be stored is
document.generatedTitle.Storing values in an existing field
If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a
PropertySetter.If it cannot generate a title, it will fall-back to retrieving the first sentence from the text.
The generated title length is limited to 150 characters by default. You can change that limit by using
setTitleMaxLength(int). Text larger than the max limit will be truncated and three dots will be added in square brackets ([...]). To remove the limit, use -1 (or constantUNLIMITED_TITLE_LENGTH).This class should be used as a post-parsing handler only (or otherwise on unformatted text).
The algorithm to detect titles is quite basic. It uses a generic statistics-based approach to weight each sentences up to a certain amount, and simply returns the sentence with the highest attributed weight given a minimum threshold has been met. You are strongly encouraged to use a more sophisticated summarization engine if you want more accurate titles generated.
Max read size
This tagger will only analyze up to the first 10,000 characters. You can change this maximum with
AbstractStringTagger.setMaxReadSize(int). Given this class is not optimized for large content analysis, setting a huge maximum number of characters could cause serious performance issues on large large files.XML configuration usage:
<handler class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger" maxReadSize="(max characters to read at once)" sourceCharset="(character encoding)" fromField="(field of text to use/default uses document content)" toField="(target field where to store generated title)" titleMaxLength="(max num of chars for generated title)" detectHeading="[false|true]" detectHeadingMinLength="(min length a heading title can have)" detectHeadingMaxLength="(max length a heading title can have)"> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <restrictTo> <fieldMatcher>(field-matching expression)</fieldMatcher> <valueMatcher>(value-matching expression)</valueMatcher> </restrictTo> </handler>XML usage example:
<handler class="TitleGeneratorTagger" toField="title" titleMaxLength="200" detectHeading="true"/>The above will check if the first line looks like a title and if not, it will store the first sentence, up to 200 characters, in a field called title.
- Since:
- 2.1.0
- Author:
- Pascal Essiembre
-
-
Field Summary
Fields Modifier and Type Field Description static intDEFAULT_HEADING_MAX_LENGTHstatic intDEFAULT_HEADING_MIN_LENGTHstatic intDEFAULT_MAX_READ_SIZEstatic intDEFAULT_TITLE_MAX_LENGTHstatic StringDEFAULT_TO_FIELDstatic intUNLIMITED_TITLE_LENGTH
-
Constructor Summary
Constructors Constructor Description TitleGeneratorTagger()
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description booleanequals(Object other)intgetDetectHeadingMaxLength()intgetDetectHeadingMinLength()StringgetFromField()PropertySettergetOnSet()Gets the property setter to use when a value is set.intgetTitleMaxLength()StringgetToField()inthashCode()booleanisDetectHeading()booleanisOverwrite()Deprecated.Since 3.0.0 usegetOnSet().protected voidloadStringTaggerFromXML(XML xml)Loads configuration settings specific to the implementing class.protected voidsaveStringTaggerToXML(XML xml)Saves configuration settings specific to the implementing class.voidsetDetectHeading(boolean detectHeading)voidsetDetectHeadingMaxLength(int detectHeadingMaxLength)voidsetDetectHeadingMinLength(int detectHeadingMinLength)voidsetFromField(String fromField)voidsetOnSet(PropertySetter onSet)Sets the property setter to use when a value is set.voidsetOverwrite(boolean overwrite)Deprecated.Since 3.0.0 usesetOnSet(PropertySetter).voidsetTitleMaxLength(int titleMaxLength)voidsetToField(String toField)protected voidtagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex)StringtoString()-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractStringTagger
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML
-
-
-
-
Field Detail
-
DEFAULT_TO_FIELD
public static final String DEFAULT_TO_FIELD
- See Also:
- Constant Field Values
-
DEFAULT_TITLE_MAX_LENGTH
public static final int DEFAULT_TITLE_MAX_LENGTH
- See Also:
- Constant Field Values
-
UNLIMITED_TITLE_LENGTH
public static final int UNLIMITED_TITLE_LENGTH
- See Also:
- Constant Field Values
-
DEFAULT_HEADING_MIN_LENGTH
public static final int DEFAULT_HEADING_MIN_LENGTH
- See Also:
- Constant Field Values
-
DEFAULT_HEADING_MAX_LENGTH
public static final int DEFAULT_HEADING_MAX_LENGTH
- See Also:
- Constant Field Values
-
DEFAULT_MAX_READ_SIZE
public static final int DEFAULT_MAX_READ_SIZE
- See Also:
- Constant Field Values
-
-
Method Detail
-
tagStringContent
protected void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
- Specified by:
tagStringContentin classAbstractStringTagger- Throws:
ImporterHandlerException
-
getToField
public String getToField()
-
setToField
public void setToField(String toField)
-
isOverwrite
@Deprecated public boolean isOverwrite()
Deprecated.Since 3.0.0 usegetOnSet().Gets whether existing value for the same field should be overwritten.- Returns:
trueif overwriting existing value.
-
setOverwrite
@Deprecated public void setOverwrite(boolean overwrite)
Deprecated.Since 3.0.0 usesetOnSet(PropertySetter).Sets whether existing value for the same field should be overwritten.- Parameters:
overwrite-trueif overwriting existing value.
-
getFromField
public String getFromField()
-
setFromField
public void setFromField(String fromField)
-
getTitleMaxLength
public int getTitleMaxLength()
-
setTitleMaxLength
public void setTitleMaxLength(int titleMaxLength)
-
isDetectHeading
public boolean isDetectHeading()
-
setDetectHeading
public void setDetectHeading(boolean detectHeading)
-
getDetectHeadingMinLength
public int getDetectHeadingMinLength()
-
setDetectHeadingMinLength
public void setDetectHeadingMinLength(int detectHeadingMinLength)
-
getDetectHeadingMaxLength
public int getDetectHeadingMaxLength()
-
setDetectHeadingMaxLength
public void setDetectHeadingMaxLength(int detectHeadingMaxLength)
-
getOnSet
public PropertySetter getOnSet()
Gets the property setter to use when a value is set.- Returns:
- property setter
- Since:
- 3.0.0
-
setOnSet
public void setOnSet(PropertySetter onSet)
Sets the property setter to use when a value is set.- Parameters:
onSet- property setter- Since:
- 3.0.0
-
loadStringTaggerFromXML
protected void loadStringTaggerFromXML(XML xml)
Description copied from class:AbstractStringTaggerLoads configuration settings specific to the implementing class.- Specified by:
loadStringTaggerFromXMLin classAbstractStringTagger- Parameters:
xml- xml configuration
-
saveStringTaggerToXML
protected void saveStringTaggerToXML(XML xml)
Description copied from class:AbstractStringTaggerSaves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Specified by:
saveStringTaggerToXMLin classAbstractStringTagger- Parameters:
xml- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equalsin classAbstractStringTagger
-
hashCode
public int hashCode()
- Overrides:
hashCodein classAbstractStringTagger
-
toString
public String toString()
- Overrides:
toStringin classAbstractStringTagger
-
-