Class LanguageTagger

  • All Implemented Interfaces:
    IXMLConfigurable, IImporterHandler, IDocumentTagger

    public class LanguageTagger
    extends AbstractStringTagger
    implements IXMLConfigurable

    Detects a document language based on Apache Tika language detection capability. It adds the detected language to the "document.language" metadata field. Optionally adds all potential languages detected with their probability score as well as additional fields following this pattern:

     document.language.<rank>.tag
     document.language.<rank>.probability

    <rank> is to indicate the match order, based on match probability score (starting at 1).

    This tagger can be used both as a pre-parse (on text only) or post-parse handler.

    Accuracy:

    To obtain optimal detection, long enough text is expected. The default detection algorithm is optimized for document with lots of text. This tagger relies on Tika language detection capabilities and future versions may provide better precision for documents made of short text (e.g. tweets, comments, etc).

    If you know what mix of languages are used by your site(s), you can increase accuracy in many cases by limiting the set of languages supported for detection.

    Supported Languages:

    Languages are represented as code values. As of 2.6.0, at least the following 70 languages are supported by the Tika version used:

    • af Afrikaans
    • an Aragonese
    • ar Arabic
    • ast Asturian
    • be Belarusian
    • br Breton
    • ca Catalan
    • bg Bulgarian
    • bn Bengali
    • cs Czech
    • cy Welsh
    • da Danish
    • de German
    • el Greek
    • en English
    • es Spanish
    • et Estonian
    • eu Basque
    • fa Persian
    • fi Finnish
    • fr French
    • ga Irish
    • gl Galician
    • gu Gujarati
    • he Hebrew
    • hi Hindi
    • hr Croatian
    • ht Haitian
    • hu Hungarian
    • id Indonesian
    • is Icelandic
    • it Italian
    • ja Japanese
    • km Khmer
    • kn Kannada
    • ko Korean
    • lt Lithuanian
    • lv Latvian
    • mk Macedonian
    • ml Malayalam
    • mr Marathi
    • ms Malay
    • mt Maltese
    • ne Nepali
    • nl Dutch
    • no Norwegian
    • oc Occitan
    • pa Punjabi
    • pl Polish
    • pt Portuguese
    • ro Romanian
    • ru Russian
    • sk Slovak
    • sl Slovene
    • so Somali
    • sq Albanian
    • sr Serbian
    • sv Swedish
    • sw Swahili
    • ta Tamil
    • te Telugu
    • th Thai
    • tl Tagalog
    • tr Turkish
    • uk Ukrainian
    • ur Urdu
    • vi Vietnamese
    • yi Yiddish
    • zh-cn Simplified Chinese
    • zh-tw Traditional Chinese

    It is possible more will be supported automatically with future Tika upgrades.

    If you do not restrict the list of language candidates to detect, the default behavior is to try match all languages currently supported.

    XML configuration usage:

    
    <handler
        class="com.norconex.importer.handler.tagger.impl.LanguageTagger"
        keepProbabilities="(false|true)"
        toField="(custom target field to store the language)"
        fallbackLanguage="(default language when detection failed)"
        maxReadSize="(max characters to read at once)"
        sourceCharset="(character encoding)">
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
      <restrictTo>
        <fieldMatcher>(field-matching expression)</fieldMatcher>
        <valueMatcher>(value-matching expression)</valueMatcher>
      </restrictTo>
      <languages>
        (CSV list of language tag candidates. Defaults to the above list.)
      </languages>
    </handler>

    XML usage example:

    
    <handler
        class="LanguageTagger"
        fallbackLanguage="en">
      <languages>en, fr</languages>
    </handler>

    The above example detects whether pages are English or French, falling back to English if detection failed.

    Since:
    2.0.0
    Author:
    Pascal Essiembre
    • Constructor Detail

      • LanguageTagger

        public LanguageTagger()
    • Method Detail

      • isKeepProbabilities

        public boolean isKeepProbabilities()
      • setKeepProbabilities

        public void setKeepProbabilities​(boolean keepProbabilities)
        Sets whether to keep the match probabilities for each languages detected. Default is false.
        Parameters:
        keepProbabilities - true to keep probabilities
      • getFallbackLanguage

        public String getFallbackLanguage()
      • setFallbackLanguage

        public void setFallbackLanguage​(String fallbackLanguage)
        Sets the fallback language when none are detected. Default behavior is to not tag incoming documents with a language field when no detection occurs.
        Parameters:
        fallbackLanguage - the default languages when no detection
      • getLanguages

        public List<String> getLanguages()
      • setLanguages

        public void setLanguages​(List<String> languages)
        Sets the language candidates for the language detection.
        Parameters:
        languages - languages to consider for detection
      • saveStringTaggerToXML

        protected void saveStringTaggerToXML​(XML xml)
        Description copied from class: AbstractStringTagger
        Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.
        Specified by:
        saveStringTaggerToXML in class AbstractStringTagger
        Parameters:
        xml - the XML