Class LanguageTagger
- java.lang.Object
-
- com.norconex.importer.handler.AbstractImporterHandler
-
- com.norconex.importer.handler.tagger.AbstractDocumentTagger
-
- com.norconex.importer.handler.tagger.AbstractCharStreamTagger
-
- com.norconex.importer.handler.tagger.AbstractStringTagger
-
- com.norconex.importer.handler.tagger.impl.LanguageTagger
-
- All Implemented Interfaces:
IXMLConfigurable
,IImporterHandler
,IDocumentTagger
public class LanguageTagger extends AbstractStringTagger implements IXMLConfigurable
Detects a document language based on Apache Tika language detection capability. It adds the detected language to the "
document.language
" metadata field. Optionally adds all potential languages detected with their probability score as well as additional fields following this pattern:document.language.<rank>.tag document.language.<rank>.probability
<rank>
is to indicate the match order, based on match probability score (starting at 1).This tagger can be used both as a pre-parse (on text only) or post-parse handler.
Accuracy:
To obtain optimal detection, long enough text is expected. The default detection algorithm is optimized for document with lots of text. This tagger relies on Tika language detection capabilities and future versions may provide better precision for documents made of short text (e.g. tweets, comments, etc).
If you know what mix of languages are used by your site(s), you can increase accuracy in many cases by limiting the set of languages supported for detection.
Supported Languages:
Languages are represented as code values. As of 2.6.0, at least the following 70 languages are supported by the Tika version used:
- af Afrikaans
- an Aragonese
- ar Arabic
- ast Asturian
- be Belarusian
- br Breton
- ca Catalan
- bg Bulgarian
- bn Bengali
- cs Czech
- cy Welsh
- da Danish
- de German
- el Greek
- en English
- es Spanish
- et Estonian
- eu Basque
- fa Persian
- fi Finnish
- fr French
- ga Irish
- gl Galician
- gu Gujarati
- he Hebrew
- hi Hindi
- hr Croatian
- ht Haitian
- hu Hungarian
- id Indonesian
- is Icelandic
- it Italian
- ja Japanese
- km Khmer
- kn Kannada
- ko Korean
- lt Lithuanian
- lv Latvian
- mk Macedonian
- ml Malayalam
- mr Marathi
- ms Malay
- mt Maltese
- ne Nepali
- nl Dutch
- no Norwegian
- oc Occitan
- pa Punjabi
- pl Polish
- pt Portuguese
- ro Romanian
- ru Russian
- sk Slovak
- sl Slovene
- so Somali
- sq Albanian
- sr Serbian
- sv Swedish
- sw Swahili
- ta Tamil
- te Telugu
- th Thai
- tl Tagalog
- tr Turkish
- uk Ukrainian
- ur Urdu
- vi Vietnamese
- yi Yiddish
- zh-cn Simplified Chinese
- zh-tw Traditional Chinese
It is possible more will be supported automatically with future Tika upgrades.
If you do not restrict the list of language candidates to detect, the default behavior is to try match all languages currently supported.
XML configuration usage:
<handler class="com.norconex.importer.handler.tagger.impl.LanguageTagger" keepProbabilities="(false|true)" toField="(custom target field to store the language)" fallbackLanguage="(default language when detection failed)" maxReadSize="(max characters to read at once)" sourceCharset="(character encoding)"> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <restrictTo> <fieldMatcher>(field-matching expression)</fieldMatcher> <valueMatcher>(value-matching expression)</valueMatcher> </restrictTo> <languages> (CSV list of language tag candidates. Defaults to the above list.) </languages> </handler>
XML usage example:
<handler class="LanguageTagger" fallbackLanguage="en"> <languages>en, fr</languages> </handler>
The above example detects whether pages are English or French, falling back to English if detection failed.
- Since:
- 2.0.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description LanguageTagger()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
String
getFallbackLanguage()
List<String>
getLanguages()
int
hashCode()
boolean
isKeepProbabilities()
protected void
loadStringTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected void
saveStringTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setFallbackLanguage(String fallbackLanguage)
Sets the fallback language when none are detected.void
setKeepProbabilities(boolean keepProbabilities)
Sets whether to keep the match probabilities for each languages detected.void
setLanguages(List<String> languages)
Sets the language candidates for the language detection.protected void
tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex)
String
toString()
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractStringTagger
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
-
Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument
-
Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML
-
-
-
-
Method Detail
-
tagStringContent
protected void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
- Specified by:
tagStringContent
in classAbstractStringTagger
- Throws:
ImporterHandlerException
-
isKeepProbabilities
public boolean isKeepProbabilities()
-
setKeepProbabilities
public void setKeepProbabilities(boolean keepProbabilities)
Sets whether to keep the match probabilities for each languages detected. Default isfalse
.- Parameters:
keepProbabilities
-true
to keep probabilities
-
getFallbackLanguage
public String getFallbackLanguage()
-
setFallbackLanguage
public void setFallbackLanguage(String fallbackLanguage)
Sets the fallback language when none are detected. Default behavior is to not tag incoming documents with a language field when no detection occurs.- Parameters:
fallbackLanguage
- the default languages when no detection
-
setLanguages
public void setLanguages(List<String> languages)
Sets the language candidates for the language detection.- Parameters:
languages
- languages to consider for detection
-
loadStringTaggerFromXML
protected void loadStringTaggerFromXML(XML xml)
Description copied from class:AbstractStringTagger
Loads configuration settings specific to the implementing class.- Specified by:
loadStringTaggerFromXML
in classAbstractStringTagger
- Parameters:
xml
- xml configuration
-
saveStringTaggerToXML
protected void saveStringTaggerToXML(XML xml)
Description copied from class:AbstractStringTagger
Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.- Specified by:
saveStringTaggerToXML
in classAbstractStringTagger
- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractStringTagger
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractStringTagger
-
toString
public String toString()
- Overrides:
toString
in classAbstractStringTagger
-
-