java.lang.Object
- com.norconex.importer.handler.AbstractImporterHandler
- - com.norconex.importer.handler.tagger.AbstractDocumentTagger
  - - com.norconex.importer.handler.tagger.AbstractCharStreamTagger
    - - com.norconex.importer.handler.tagger.AbstractStringTagger
      - com.norconex.importer.handler.tagger.impl.LanguageTagger

All Implemented Interfaces:

IXMLConfigurable, IImporterHandler, IDocumentTagger
```
public class LanguageTagger
extends AbstractStringTagger
implements IXMLConfigurable
```
Detects a document language based on Apache Tika language detection capability. It adds the detected language to the "document.language" metadata field. Optionally adds all potential languages detected with their probability score as well as additional fields following this pattern:
```
 document.language.<rank>.tag
 document.language.<rank>.probability
```
<rank> is to indicate the match order, based on match probability score (starting at 1).

This tagger can be used both as a pre-parse (on text only) or post-parse handler.

Accuracy:

To obtain optimal detection, long enough text is expected. The default detection algorithm is optimized for document with lots of text. This tagger relies on Tika language detection capabilities and future versions may provide better precision for documents made of short text (e.g. tweets, comments, etc).

If you know what mix of languages are used by your site(s), you can increase accuracy in many cases by limiting the set of languages supported for detection.

Supported Languages:

Languages are represented as code values. As of 2.6.0, at least the following 70 languages are supported by the Tika version used:
- af Afrikaans
- an Aragonese
- ar Arabic
- ast Asturian
- be Belarusian
- br Breton
- ca Catalan
- bg Bulgarian
- bn Bengali
- cs Czech
- cy Welsh
- da Danish
- de German
- el Greek
- en English
- es Spanish
- et Estonian
- eu Basque
- fa Persian
- fi Finnish
- fr French
- ga Irish
- gl Galician
- gu Gujarati
- he Hebrew
- hi Hindi
- hr Croatian
- ht Haitian
- hu Hungarian
- id Indonesian
- is Icelandic
- it Italian
- ja Japanese
- km Khmer
- kn Kannada
- ko Korean
- lt Lithuanian
- lv Latvian
- mk Macedonian
- ml Malayalam
- mr Marathi
- ms Malay
- mt Maltese
- ne Nepali
- nl Dutch
- no Norwegian
- oc Occitan
- pa Punjabi
- pl Polish
- pt Portuguese
- ro Romanian
- ru Russian
- sk Slovak
- sl Slovene
- so Somali
- sq Albanian
- sr Serbian
- sv Swedish
- sw Swahili
- ta Tamil
- te Telugu
- th Thai
- tl Tagalog
- tr Turkish
- uk Ukrainian
- ur Urdu
- vi Vietnamese
- yi Yiddish
- zh-cn Simplified Chinese
- zh-tw Traditional Chinese
It is possible more will be supported automatically with future Tika upgrades.

If you do not restrict the list of language candidates to detect, the default behavior is to try match all languages currently supported.

XML configuration usage:
```
<handler
    class="com.norconex.importer.handler.tagger.impl.LanguageTagger"
    keepProbabilities="(false|true)"
    toField="(custom target field to store the language)"
    fallbackLanguage="(default language when detection failed)"
    maxReadSize="(max characters to read at once)"
    sourceCharset="(character encoding)">
  
  <restrictTo>
    <fieldMatcher>(field-matching expression)</fieldMatcher>
    <valueMatcher>(value-matching expression)</valueMatcher>
  </restrictTo>
  <languages>
    (CSV list of language tag candidates. Defaults to the above list.)
  </languages>
</handler>
```
XML usage example:
```
<handler
    class="LanguageTagger"
    fallbackLanguage="en">
  <languages>en, fr</languages>
</handler>
```
The above example detects whether pages are English or French, falling back to English if detection failed.
Since:

2.0.0

Author:

Pascal Essiembre

Constructor Summary

Constructors
Constructor Description

LanguageTagger()

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`boolean`	`equals(Object other)`
`String`	`getFallbackLanguage()`
`List<String>`	`getLanguages()`
`int`	`hashCode()`
`boolean`	`isKeepProbabilities()`
`protected void`	`loadStringTaggerFromXML(XML xml)`	Loads configuration settings specific to the implementing class.
`protected void`	`saveStringTaggerToXML(XML xml)`	Saves configuration settings specific to the implementing class.
`void`	`setFallbackLanguage(String fallbackLanguage)`	Sets the fallback language when none are detected.
`void`	`setKeepProbabilities(boolean keepProbabilities)`	Sets whether to keep the match probabilities for each languages detected.
`void`	`setLanguages(List<String> languages)`	Sets the language candidates for the language detection.
`protected void`	`tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex)`
`String`	`toString()`

Methods inherited from class com.norconex.importer.handler.tagger.AbstractStringTagger
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument

Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument

Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML

- Constructor Detail
  - LanguageTagger
```
public LanguageTagger()
```
- Method Detail
  - tagStringContent
```
protected void tagStringContent(HandlerDoc doc,
                                StringBuilder content,
                                ParseState parseState,
                                int sectionIndex)
                         throws ImporterHandlerException
```
    Specified by:
    
    tagStringContent in class AbstractStringTagger
    
    Throws:
    
    ImporterHandlerException
  - isKeepProbabilities
```
public boolean isKeepProbabilities()
```
  - setKeepProbabilities
```
public void setKeepProbabilities(boolean keepProbabilities)
```
    Sets whether to keep the match probabilities for each languages detected. Default is false.
    
    Parameters:
    
    keepProbabilities - true to keep probabilities
  - getFallbackLanguage
```
public String getFallbackLanguage()
```
  - setFallbackLanguage
```
public void setFallbackLanguage(String fallbackLanguage)
```
    Sets the fallback language when none are detected. Default behavior is to not tag incoming documents with a language field when no detection occurs.
    
    Parameters:
    
    fallbackLanguage - the default languages when no detection
  - getLanguages
```
public List<String> getLanguages()
```
  - setLanguages
```
public void setLanguages(List<String> languages)
```
    Sets the language candidates for the language detection.
    
    Parameters:
    
    languages - languages to consider for detection
  - loadStringTaggerFromXML
```
protected void loadStringTaggerFromXML(XML xml)
```
    Description copied from class: AbstractStringTagger
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadStringTaggerFromXML in class AbstractStringTagger
    
    Parameters:
    
    xml - xml configuration
  - saveStringTaggerToXML
```
protected void saveStringTaggerToXML(XML xml)
```
    Description copied from class: AbstractStringTagger
    
    Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.
    
    Specified by:
    
    saveStringTaggerToXML in class AbstractStringTagger
    
    Parameters:
    
    xml - the XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractStringTagger
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractStringTagger
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractStringTagger

Class LanguageTagger

Accuracy:

Supported Languages:

XML configuration usage:

XML usage example:

Constructor Summary

Method Summary

Methods inherited from class com.norconex.importer.handler.tagger.AbstractStringTagger

Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger

Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler

Methods inherited from class java.lang.Object

Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable

Constructor Detail

LanguageTagger

Method Detail

tagStringContent

isKeepProbabilities

setKeepProbabilities

getFallbackLanguage

setFallbackLanguage

getLanguages

setLanguages

loadStringTaggerFromXML

saveStringTaggerToXML

equals

hashCode

toString