TitleGeneratorTagger (Norconex Importer 3.0.1 API)

java.lang.Object
- com.norconex.importer.handler.AbstractImporterHandler
- - com.norconex.importer.handler.tagger.AbstractDocumentTagger
  - - com.norconex.importer.handler.tagger.AbstractCharStreamTagger
    - - com.norconex.importer.handler.tagger.AbstractStringTagger
      - com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger

All Implemented Interfaces:

IXMLConfigurable, IImporterHandler, IDocumentTagger
```
public class TitleGeneratorTagger
extends AbstractStringTagger
implements IXMLConfigurable
```
Attempts to generate a title from the document content (default) or a specified metadata field. It does not consider the document format/structure nor does it weight some terms more than others. For instance, it would not consider text found in <H1> tags more importantly than other text in HTML documents.

If isDetectHeading() returns true, this handler will check if the content starts with a stand-alone, single-sentence line (which is assumed to be the actual title). That is, a line of text with only one sentence in it, followed by one or more new line characters. To help eliminate cases where such sentence are inappropriate, you can specify a minimum and maximum number of characters that first line should have with setDetectHeadingMinLength(int) and setDetectHeadingMaxLength(int) (e.g. to ignore "Page 1" text and the like).

Unless a target field name is provided, the default field name where the title will be stored is document.generatedTitle.
Storing values in an existing field

If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a PropertySetter.

If it cannot generate a title, it will fall-back to retrieving the first sentence from the text.

The generated title length is limited to 150 characters by default. You can change that limit by using setTitleMaxLength(int). Text larger than the max limit will be truncated and three dots will be added in square brackets ([...]). To remove the limit, use -1 (or constant UNLIMITED_TITLE_LENGTH).

This class should be used as a post-parsing handler only (or otherwise on unformatted text).

The algorithm to detect titles is quite basic. It uses a generic statistics-based approach to weight each sentences up to a certain amount, and simply returns the sentence with the highest attributed weight given a minimum threshold has been met. You are strongly encouraged to use a more sophisticated summarization engine if you want more accurate titles generated.

Max read size

This tagger will only analyze up to the first 10,000 characters. You can change this maximum with AbstractStringTagger.setMaxReadSize(int). Given this class is not optimized for large content analysis, setting a huge maximum number of characters could cause serious performance issues on large large files.

XML configuration usage:
```
<handler
    class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger"
    maxReadSize="(max characters to read at once)"
    sourceCharset="(character encoding)"
    fromField="(field of text to use/default uses document content)"
    toField="(target field where to store generated title)"
    onSet="[append|prepend|replace|optional]"
    titleMaxLength="(max num of chars for generated title)"
    detectHeading="[false|true]"
    detectHeadingMinLength="(min length a heading title can have)"
    detectHeadingMaxLength="(max length a heading title can have)">
  
  <restrictTo>
    <fieldMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (field-matching expression)
    </fieldMatcher>
    <valueMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (value-matching expression)
    </valueMatcher>
  </restrictTo>
</handler>
```
XML usage example:
```
<handler
    class="TitleGeneratorTagger"
    toField="title"
    titleMaxLength="200"
    detectHeading="true"/>
```
The above will check if the first line looks like a title and if not, it will store the first sentence, up to 200 characters, in a field called title.
Since:

2.1.0

Author:

Pascal Essiembre

Field Summary

Fields
Modifier and Type	Field and Description
`static int`	`DEFAULT_HEADING_MAX_LENGTH`
`static int`	`DEFAULT_HEADING_MIN_LENGTH`
`static int`	`DEFAULT_MAX_READ_SIZE`
`static int`	`DEFAULT_TITLE_MAX_LENGTH`
`static String`	`DEFAULT_TO_FIELD`
`static int`	`UNLIMITED_TITLE_LENGTH`

Constructor Summary

Constructors
Constructor and Description

TitleGeneratorTagger()

Constructors
Constructor and Description
`TitleGeneratorTagger()`

Method Summary

All Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`boolean`	`equals(Object other)`
`int`	`getDetectHeadingMaxLength()`
`int`	`getDetectHeadingMinLength()`
`String`	`getFromField()`
`PropertySetter`	`getOnSet()` Gets the property setter to use when a value is set.
`int`	`getTitleMaxLength()`
`String`	`getToField()`
`int`	`hashCode()`
`boolean`	`isDetectHeading()`
`boolean`	`isOverwrite()` Deprecated. Since 3.0.0 use `getOnSet()`.
`protected void`	`loadStringTaggerFromXML(XML xml)` Loads configuration settings specific to the implementing class.
`protected void`	`saveStringTaggerToXML(XML xml)` Saves configuration settings specific to the implementing class.
`void`	`setDetectHeading(boolean detectHeading)`
`void`	`setDetectHeadingMaxLength(int detectHeadingMaxLength)`
`void`	`setDetectHeadingMinLength(int detectHeadingMinLength)`
`void`	`setFromField(String fromField)`
`void`	`setOnSet(PropertySetter onSet)` Sets the property setter to use when a value is set.
`void`	`setOverwrite(boolean overwrite)` Deprecated. Since 3.0.0 use `setOnSet(PropertySetter)`.
`void`	`setTitleMaxLength(int titleMaxLength)`
`void`	`setToField(String toField)`
`protected void`	`tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex)`
`String`	`toString()`

Methods inherited from class com.norconex.importer.handler.tagger.AbstractStringTagger
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument

Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument

Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML

- Field Detail
  - DEFAULT_TO_FIELD
```
public static final String DEFAULT_TO_FIELD
```
    See Also:
    
    Constant Field Values
  - DEFAULT_TITLE_MAX_LENGTH
```
public static final int DEFAULT_TITLE_MAX_LENGTH
```
    See Also:
    
    Constant Field Values
  - UNLIMITED_TITLE_LENGTH
```
public static final int UNLIMITED_TITLE_LENGTH
```
    See Also:
    
    Constant Field Values
  - DEFAULT_HEADING_MIN_LENGTH
```
public static final int DEFAULT_HEADING_MIN_LENGTH
```
    See Also:
    
    Constant Field Values
  - DEFAULT_HEADING_MAX_LENGTH
```
public static final int DEFAULT_HEADING_MAX_LENGTH
```
    See Also:
    
    Constant Field Values
  - DEFAULT_MAX_READ_SIZE
```
public static final int DEFAULT_MAX_READ_SIZE
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - TitleGeneratorTagger
```
public TitleGeneratorTagger()
```
- Method Detail
  - tagStringContent
```
protected void tagStringContent(HandlerDoc doc,
                                StringBuilder content,
                                ParseState parseState,
                                int sectionIndex)
                         throws ImporterHandlerException
```
    Specified by:
    
    tagStringContent in class AbstractStringTagger
    
    Throws:
    
    ImporterHandlerException
  - getToField
```
public String getToField()
```
  - setToField
```
public void setToField(String toField)
```
  - isOverwrite
```
@Deprecated
public boolean isOverwrite()
```
    Deprecated. Since 3.0.0 use getOnSet().
    
    Gets whether existing value for the same field should be overwritten.
    
    Returns:
    
    true if overwriting existing value.
  - setOverwrite
```
@Deprecated
public void setOverwrite(boolean overwrite)
```
    Deprecated. Since 3.0.0 use setOnSet(PropertySetter).
    
    Sets whether existing value for the same field should be overwritten.
    
    Parameters:
    
    overwrite - true if overwriting existing value.
  - getFromField
```
public String getFromField()
```
  - setFromField
```
public void setFromField(String fromField)
```
  - getTitleMaxLength
```
public int getTitleMaxLength()
```
  - setTitleMaxLength
```
public void setTitleMaxLength(int titleMaxLength)
```
  - isDetectHeading
```
public boolean isDetectHeading()
```
  - setDetectHeading
```
public void setDetectHeading(boolean detectHeading)
```
  - getDetectHeadingMinLength
```
public int getDetectHeadingMinLength()
```
  - setDetectHeadingMinLength
```
public void setDetectHeadingMinLength(int detectHeadingMinLength)
```
  - getDetectHeadingMaxLength
```
public int getDetectHeadingMaxLength()
```
  - setDetectHeadingMaxLength
```
public void setDetectHeadingMaxLength(int detectHeadingMaxLength)
```
  - getOnSet
```
public PropertySetter getOnSet()
```
    Gets the property setter to use when a value is set.
    
    Returns:
    
    property setter
    
    Since:
    
    3.0.0
  - setOnSet
```
public void setOnSet(PropertySetter onSet)
```
    Sets the property setter to use when a value is set.
    
    Parameters:
    
    onSet - property setter
    
    Since:
    
    3.0.0
  - loadStringTaggerFromXML
```
protected void loadStringTaggerFromXML(XML xml)
```
    Description copied from class: AbstractStringTagger
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadStringTaggerFromXML in class AbstractStringTagger
    
    Parameters:
    
    xml - xml configuration
  - saveStringTaggerToXML
```
protected void saveStringTaggerToXML(XML xml)
```
    Description copied from class: AbstractStringTagger
    
    Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.
    
    Specified by:
    
    saveStringTaggerToXML in class AbstractStringTagger
    
    Parameters:
    
    xml - the XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractStringTagger
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractStringTagger
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractStringTagger

Class TitleGeneratorTagger

Storing values in an existing field

Max read size

XML configuration usage:

XML usage example:

Field Summary

Constructor Summary

Method Summary

Methods inherited from class com.norconex.importer.handler.tagger.AbstractStringTagger

Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger

Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler

Methods inherited from class java.lang.Object

Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable

Field Detail

DEFAULT_TO_FIELD

DEFAULT_TITLE_MAX_LENGTH

UNLIMITED_TITLE_LENGTH

DEFAULT_HEADING_MIN_LENGTH

DEFAULT_HEADING_MAX_LENGTH

DEFAULT_MAX_READ_SIZE

Constructor Detail

TitleGeneratorTagger

Method Detail

tagStringContent

getToField

setToField

isOverwrite

setOverwrite

getFromField

setFromField

getTitleMaxLength

setTitleMaxLength

isDetectHeading

setDetectHeading

getDetectHeadingMinLength

setDetectHeadingMinLength

getDetectHeadingMaxLength

setDetectHeadingMaxLength

getOnSet

setOnSet

loadStringTaggerFromXML

saveStringTaggerToXML

equals

hashCode

toString