public class URLExtractorTagger extends AbstractCharStreamTagger implements IXMLConfigurable
Extracts unique URLs matching specific patterns in plain text content and store them in a given field.
URL-matching patterns used are relatively simple. It looks for strings
starting with http://
, https://
,
or www.
. The later is prefixed with https://
when encountered (to make it absolute).
The matching is case-insensitive. If you need alternate ways to detect URLs,
you can use a combination of RegexTagger
, ReplaceTagger
, or
create your own implementation.
If a target field with the same name already exists for a document,
values will be added to the end of the existing value list.
It is possible to change this default behavior by supplying a
PropertySetter
.
If no URLs are found, the target field values (if any) are left intact.
It is possible to specify a fromField
as the source of the text to use instead of using the document content.
This class is typically e used as a post-parsing handler only (to ensure we are dealing with text).
<handler
class="com.norconex.importer.handler.tagger.impl.URLExtractorTagger"
toField="(target field where to store extracted URLs)"
maxReadSize="(max characters to read at once)"
sourceCharset="(character encoding)"
onSet="[append|prepend|replace|optional]">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(Optional field of text to use. Default uses document content.)
</fieldMatcher>
</handler>
<handler
class="URLExtractorTagger"
toField="documentURLs">
<restrictTo>
<fieldMatcher>document.contentType</fieldMatcher>
<valueMatcher>application/pdf</valueMatcher>
</restrictTo>
</handler>
The above example is used as a post-parse handler. It detects URLs in parsed PDFs and store those URLs in a field call "documentURLs".
Constructor and Description |
---|
URLExtractorTagger() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
TextMatcher |
getFieldMatcher()
Gets field matcher for fields containing text.
|
int |
getMaxReadSize()
Gets the maximum number of characters to read from content for tagging
at once.
|
PropertySetter |
getOnSet()
Gets the property setter to use when a value is set.
|
String |
getToField() |
int |
hashCode() |
protected void |
loadCharStreamTaggerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveCharStreamTaggerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setFieldMatcher(TextMatcher fieldMatcher)
Sets the field matcher for fields containing text.
|
void |
setMaxReadSize(int maxReadSize)
Sets the maximum number of characters to read from content for tagging
at once.
|
void |
setOnSet(PropertySetter onSet)
Sets the property setter to use when a value is set.
|
void |
setToField(String toField) |
protected void |
tagTextDocument(HandlerDoc doc,
Reader input,
ParseState parseState) |
String |
toString() |
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument
tagDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
protected void tagTextDocument(HandlerDoc doc, Reader input, ParseState parseState) throws ImporterHandlerException
tagTextDocument
in class AbstractCharStreamTagger
ImporterHandlerException
public String getToField()
public void setToField(String toField)
public TextMatcher getFieldMatcher()
public void setFieldMatcher(TextMatcher fieldMatcher)
fieldMatcher
- field matcherpublic PropertySetter getOnSet()
public void setOnSet(PropertySetter onSet)
onSet
- property setterpublic int getMaxReadSize()
TextReader.DEFAULT_MAX_READ_SIZE
.public void setMaxReadSize(int maxReadSize)
maxReadSize
- maximum read sizeprotected void loadCharStreamTaggerFromXML(XML xml)
AbstractCharStreamTagger
loadCharStreamTaggerFromXML
in class AbstractCharStreamTagger
xml
- xml configurationprotected void saveCharStreamTaggerToXML(XML xml)
AbstractCharStreamTagger
saveCharStreamTaggerToXML
in class AbstractCharStreamTagger
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractCharStreamTagger
public int hashCode()
hashCode
in class AbstractCharStreamTagger
public String toString()
toString
in class AbstractCharStreamTagger
Copyright © 2009–2023 Norconex Inc.. All rights reserved.