public class LanguageTagger extends AbstractStringTagger implements IXMLConfigurable
Detects a document language based on Tika language detection capability.
It adds the detected language to the
"document.language
" metadata field.
Optionally adds all potential languages detected with their
probability score as well as additional fields following this pattern:
document.language.<rank>.tag document.language.<rank>.probability
<rank>
is to indicate the match order, based
on match probability score (starting at 1).
This tagger can be used both as a pre-parse (on text only) or post-parse handler.
To obtain optimal detection, long enough text is expected. The default detection algorithm is optimized for document with lots of text. This tagger relies on Tika language detection capabilities and future versions may provide better precision for documents made of short text (e.g. tweets, comments, etc).
If you know what mix of languages are used by your site(s), you can increase accuracy in many cases by limiting the set of languages supported for detection.
Languages are represented as code values. As of 2.6.0, at least the following 70 languages are supported by the Tika version used:
It is possible more will be supported automatically with future Tika upgrades.
If you do not restrict the list of language candidates to detect, the default behavior is to try match all languages currently supported.
Since 2.6.0, this tagger uses Tika for language detection. As a result, more languages are supported, at the expense of less accuracy with short text.
<tagger class="com.norconex.importer.handler.tagger.impl.LanguageTagger" keepProbabilities="(false|true)" sourceCharset="(character encoding)" maxReadSize="(max characters to read at once)" fallbackLanguage="(default language when detection failed)" > <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <languages> (CSV list of language tag candidates. Defaults to the above list.) </languages> </tagger>
The following detects whether pages are English or French, falling back to English if detection failed.
<tagger class="com.norconex.importer.handler.tagger.impl.LanguageTagger" fallbackLanguage="en" > <languages>en, fr</languages> </tagger>
Constructor and Description |
---|
LanguageTagger() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
String |
getFallbackLanguage() |
String[] |
getLanguages() |
int |
hashCode() |
boolean |
isKeepProbabilities() |
boolean |
isShortText()
Deprecated.
Since 2.6.0, no special optimization exists for short text
and this method always returns false
|
protected void |
loadStringTaggerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveStringTaggerToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setFallbackLanguage(String fallbackLanguage)
Sets the fallback language when none are detected.
|
void |
setKeepProbabilities(boolean keepProbabilities)
Sets whether to keep the match probabilities for each languages
detected.
|
void |
setLanguages(String... languages)
Sets the language candidates for the language detection.
|
void |
setShortText(boolean shortText)
Deprecated.
Since 2.6.0, no special optimization exists for short text
and calling this method has no effect
|
protected void |
tagStringContent(String reference,
StringBuilder content,
ImporterMetadata metadata,
boolean parsed,
int sectionIndex) |
String |
toString() |
getMaxReadSize, loadCharStreamTaggerFromXML, saveCharStreamTaggerToXML, setMaxReadSize, tagTextDocument
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
protected void tagStringContent(String reference, StringBuilder content, ImporterMetadata metadata, boolean parsed, int sectionIndex) throws ImporterHandlerException
tagStringContent
in class AbstractStringTagger
ImporterHandlerException
@Deprecated public boolean isShortText()
true
to use short text detection@Deprecated public void setShortText(boolean shortText)
false
(optimized for long text).shortText
- true
to use a detection algorithm
optimized for short textpublic boolean isKeepProbabilities()
public void setKeepProbabilities(boolean keepProbabilities)
false
.keepProbabilities
- true
to keep probabilitiespublic String getFallbackLanguage()
public void setFallbackLanguage(String fallbackLanguage)
fallbackLanguage
- the default languages when no detectionpublic String[] getLanguages()
public void setLanguages(String... languages)
languages
- languages to consider for detectionprotected void loadStringTaggerFromXML(org.apache.commons.configuration.XMLConfiguration xml) throws IOException
AbstractStringTagger
loadStringTaggerFromXML
in class AbstractStringTagger
xml
- xml configurationIOException
- could not load from XMLprotected void saveStringTaggerToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractStringTagger
saveStringTaggerToXML
in class AbstractStringTagger
writer
- the xml writerXMLStreamException
- could not save to XMLpublic boolean equals(Object other)
equals
in class AbstractStringTagger
public int hashCode()
hashCode
in class AbstractStringTagger
public String toString()
toString
in class AbstractStringTagger
Copyright © 2009–2021 Norconex Inc.. All rights reserved.