public class LanguageTagger extends AbstractStringTagger implements IXMLConfigurable
Detects a document language based on Apache Tika language detection
capability.
It adds the detected language to the
"document.language
" metadata field.
Optionally adds all potential languages detected with their
probability score as well as additional fields following this pattern:
document.language.<rank>.tag document.language.<rank>.probability
<rank>
is to indicate the match order, based
on match probability score (starting at 1).
This tagger can be used both as a pre-parse (on text only) or post-parse handler.
To obtain optimal detection, long enough text is expected. The default detection algorithm is optimized for document with lots of text. This tagger relies on Tika language detection capabilities and future versions may provide better precision for documents made of short text (e.g. tweets, comments, etc).
If you know what mix of languages are used by your site(s), you can increase accuracy in many cases by limiting the set of languages supported for detection.
Languages are represented as code values. As of 2.6.0, at least the following 70 languages are supported by the Tika version used:
It is possible more will be supported automatically with future Tika upgrades.
If you do not restrict the list of language candidates to detect, the default behavior is to try match all languages currently supported.
<handler
class="com.norconex.importer.handler.tagger.impl.LanguageTagger"
keepProbabilities="(false|true)"
toField="(custom target field to store the language)"
fallbackLanguage="(default language when detection failed)"
maxReadSize="(max characters to read at once)"
sourceCharset="(character encoding)">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<languages>
(CSV list of language tag candidates. Defaults to the above list.)
</languages>
</handler>
<handler
class="LanguageTagger"
fallbackLanguage="en">
<languages>en, fr</languages>
</handler>
The above example detects whether pages are English or French, falling back to English if detection failed.
Constructor and Description |
---|
LanguageTagger() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
String |
getFallbackLanguage() |
List<String> |
getLanguages() |
int |
hashCode() |
boolean |
isKeepProbabilities() |
protected void |
loadStringTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveStringTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setFallbackLanguage(String fallbackLanguage)
Sets the fallback language when none are detected.
|
void |
setKeepProbabilities(boolean keepProbabilities)
Sets whether to keep the match probabilities for each languages
detected.
|
void |
setLanguages(List<String> languages)
Sets the language candidates for the language detection.
|
protected void |
tagStringContent(HandlerDoc doc,
StringBuilder content,
ParseState parseState,
int sectionIndex) |
String |
toString() |
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
protected void tagStringContent(HandlerDoc doc, StringBuilder content, ParseState parseState, int sectionIndex) throws ImporterHandlerException
tagStringContent
in class AbstractStringTagger
ImporterHandlerException
public boolean isKeepProbabilities()
public void setKeepProbabilities(boolean keepProbabilities)
false
.keepProbabilities
- true
to keep probabilitiespublic String getFallbackLanguage()
public void setFallbackLanguage(String fallbackLanguage)
fallbackLanguage
- the default languages when no detectionpublic void setLanguages(List<String> languages)
languages
- languages to consider for detectionprotected void loadStringTaggerFromXML(XML xml)
AbstractStringTagger
loadStringTaggerFromXML
in class AbstractStringTagger
xml
- xml configurationprotected void saveStringTaggerToXML(XML xml)
AbstractStringTagger
saveStringTaggerToXML
in class AbstractStringTagger
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractStringTagger
public int hashCode()
hashCode
in class AbstractStringTagger
public String toString()
toString
in class AbstractStringTagger
Copyright © 2009–2023 Norconex Inc.. All rights reserved.