Class TitleGeneratorTagger

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentTagger

    public class TitleGeneratorTagger
    extends AbstractStringTagger
    implements IXMLConfigurable

    Attempts to generate a title from the document content (default) or a specified metadata field. It does not consider the document format/structure nor does it weight some terms more than others. For instance, it would not consider text found in <H1> tags more importantly than other text in HTML documents.

    If isDetectHeading() returns true, this handler will check if the content starts with a stand-alone, single-sentence line (which is assumed to be the actual title). That is, a line of text with only one sentence in it, followed by one or more new line characters. To help eliminate cases where such sentence are inappropriate, you can specify a minimum and maximum number of characters that first line should have with setDetectHeadingMinLength(int) and setDetectHeadingMaxLength(int) (e.g. to ignore "Page 1" text and the like).

    Unless a target field name is provided, the default field name where the title will be stored is document.generatedTitle.

    Storing values in an existing field

    If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a PropertySetter.

    If it cannot generate a title, it will fall-back to retrieving the first sentence from the text.

    The generated title length is limited to 150 characters by default. You can change that limit by using setTitleMaxLength(int). Text larger than the max limit will be truncated and three dots will be added in square brackets ([...]). To remove the limit, use -1 (or constant UNLIMITED_TITLE_LENGTH).

    This class should be used as a post-parsing handler only (or otherwise on unformatted text).

    The algorithm to detect titles is quite basic. It uses a generic statistics-based approach to weight each sentences up to a certain amount, and simply returns the sentence with the highest attributed weight given a minimum threshold has been met. You are strongly encouraged to use a more sophisticated summarization engine if you want more accurate titles generated.

    Max read size

    This tagger will only analyze up to the first 10,000 characters. You can change this maximum with AbstractStringTagger.setMaxReadSize(int). Given this class is not optimized for large content analysis, setting a huge maximum number of characters could cause serious performance issues on large large files.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger"
        maxReadSize="(max characters to read at once)"
        sourceCharset="(character encoding)"
        fromField="(field of text to use/default uses document content)"
        toField="(target field where to store generated title)"
        titleMaxLength="(max num of chars for generated title)"
        detectHeading="[false|true]"
        detectHeadingMinLength="(min length a heading title can have)"
        detectHeadingMaxLength="(max length a heading title can have)">
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <restrictTo>
        <fieldMatcher>(field-matching expression)</fieldMatcher>
        <valueMatcher>(value-matching expression)</valueMatcher>
      </restrictTo>
    </handler>

    XML usage example:

    
    <handler
        class="TitleGeneratorTagger"
        toField="title"
        titleMaxLength="200"
        detectHeading="true"/>

    The above will check if the first line looks like a title and if not, it will store the first sentence, up to 200 characters, in a field called title.

    Since:
    2.1.0
    Author:
    Pascal Essiembre
    • Constructor Detail

      • TitleGeneratorTagger

        public TitleGeneratorTagger()
    • Method Detail

      • getToField

        public String getToField()
      • setToField

        public void setToField​(String toField)
      • isOverwrite

        @Deprecated
        public boolean isOverwrite()
        Deprecated.
        Since 3.0.0 use getOnSet().
        Gets whether existing value for the same field should be overwritten.
        Returns:
        true if overwriting existing value.
      • setOverwrite

        @Deprecated
        public void setOverwrite​(boolean overwrite)
        Deprecated.
        Since 3.0.0 use setOnSet(PropertySetter).
        Sets whether existing value for the same field should be overwritten.
        Parameters:
        overwrite - true if overwriting existing value.
      • getFromField

        public String getFromField()
      • setFromField

        public void setFromField​(String fromField)
      • getTitleMaxLength

        public int getTitleMaxLength()
      • setTitleMaxLength

        public void setTitleMaxLength​(int titleMaxLength)
      • isDetectHeading

        public boolean isDetectHeading()
      • setDetectHeading

        public void setDetectHeading​(boolean detectHeading)
      • getDetectHeadingMinLength

        public int getDetectHeadingMinLength()
      • setDetectHeadingMinLength

        public void setDetectHeadingMinLength​(int detectHeadingMinLength)
      • getDetectHeadingMaxLength

        public int getDetectHeadingMaxLength()
      • setDetectHeadingMaxLength

        public void setDetectHeadingMaxLength​(int detectHeadingMaxLength)
      • getOnSet

        public PropertySetter getOnSet()
        Gets the property setter to use when a value is set.
        Returns:
        property setter
        Since:
        3.0.0
      • setOnSet

        public void setOnSet​(PropertySetter onSet)
        Sets the property setter to use when a value is set.
        Parameters:
        onSet - property setter
        Since:
        3.0.0
      • saveStringTaggerToXML

        protected void saveStringTaggerToXML​(XML xml)
        Description copied from class: AbstractStringTagger
        Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.
        Specified by:
        saveStringTaggerToXML in class AbstractStringTagger
        Parameters:
        xml - the XML