URLExtractorTagger (Norconex Importer 3.0.1 API)

java.lang.Object
- com.norconex.importer.handler.AbstractImporterHandler
- - com.norconex.importer.handler.tagger.AbstractDocumentTagger
  - - com.norconex.importer.handler.tagger.AbstractCharStreamTagger
    - - com.norconex.importer.handler.tagger.impl.URLExtractorTagger

All Implemented Interfaces:: IXMLConfigurable, IImporterHandler, IDocumentTagger

public class URLExtractorTagger
extends AbstractCharStreamTagger
implements IXMLConfigurable

Extracts unique URLs matching specific patterns in plain text content and store them in a given field.

URL-matching patterns used are relatively simple. It looks for strings starting with http://, https://, or www.. The later is prefixed with https:// when encountered (to make it absolute).

The matching is case-insensitive. If you need alternate ways to detect URLs, you can use a combination of RegexTagger, ReplaceTagger, or create your own implementation.

Storing values in an existing field

If a target field with the same name already exists for a document, values will be added to the end of the existing value list. It is possible to change this default behavior by supplying a PropertySetter.

If no URLs are found, the target field values (if any) are left intact.

Content source

It is possible to specify a fromField as the source of the text to use instead of using the document content.

This class is typically e used as a post-parsing handler only (to ensure we are dealing with text).

XML configuration usage:


<handler
    class="com.norconex.importer.handler.tagger.impl.URLExtractorTagger"
    toField="(target field where to store extracted URLs)"
    maxReadSize="(max characters to read at once)"
    sourceCharset="(character encoding)"
    onSet="[append|prepend|replace|optional]">
  <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
  <restrictTo>
    <fieldMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (field-matching expression)
    </fieldMatcher>
    <valueMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (value-matching expression)
    </valueMatcher>
  </restrictTo>
  <fieldMatcher
      method="[basic|csv|wildcard|regex]"
      ignoreCase="[false|true]"
      ignoreDiacritic="[false|true]"
      partial="[false|true]">
    (Optional field of text to use. Default uses document content.)
  </fieldMatcher>
</handler>

XML usage example:


<handler
    class="URLExtractorTagger"
    toField="documentURLs">
  <restrictTo>
    <fieldMatcher>document.contentType</fieldMatcher>
    <valueMatcher>application/pdf</valueMatcher>
  </restrictTo>
</handler>

The above example is used as a post-parse handler. It detects URLs in parsed PDFs and store those URLs in a field call "documentURLs".

Since:: 3.0.0
Author:: Pascal Essiembre

Constructor Summary

Constructors
Constructor and Description

URLExtractorTagger()

Constructors
Constructor and Description
`URLExtractorTagger()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`equals(Object other)`
`TextMatcher`	`getFieldMatcher()` Gets field matcher for fields containing text.
`int`	`getMaxReadSize()` Gets the maximum number of characters to read from content for tagging at once.
`PropertySetter`	`getOnSet()` Gets the property setter to use when a value is set.
`String`	`getToField()`
`int`	`hashCode()`
`protected void`	`loadCharStreamTaggerFromXML(XML xml)` Loads configuration settings specific to the implementing class.
`protected void`	`saveCharStreamTaggerToXML(XML xml)` Saves configuration settings specific to the implementing class.
`void`	`setFieldMatcher(TextMatcher fieldMatcher)` Sets the field matcher for fields containing text.
`void`	`setMaxReadSize(int maxReadSize)` Sets the maximum number of characters to read from content for tagging at once.
`void`	`setOnSet(PropertySetter onSet)` Sets the property setter to use when a value is set.
`void`	`setToField(String toField)`
`protected void`	`tagTextDocument(HandlerDoc doc, Reader input, ParseState parseState)`
`String`	`toString()`

Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger
getSourceCharset, loadHandlerFromXML, saveHandlerToXML, setSourceCharset, tagApplicableDocument

Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger
tagDocument

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML

- Constructor Detail
  - URLExtractorTagger
```
public URLExtractorTagger()
```
- Method Detail
  - tagTextDocument
```
protected void tagTextDocument(HandlerDoc doc,
                               Reader input,
                               ParseState parseState)
                        throws ImporterHandlerException
```
    Specified by:
    
    tagTextDocument in class AbstractCharStreamTagger
    
    Throws:
    
    ImporterHandlerException
  - getToField
```
public String getToField()
```
  - setToField
```
public void setToField(String toField)
```
  - getFieldMatcher
```
public TextMatcher getFieldMatcher()
```
    Gets field matcher for fields containing text.
    
    Returns:
    
    field matcher
  - setFieldMatcher
```
public void setFieldMatcher(TextMatcher fieldMatcher)
```
    Sets the field matcher for fields containing text.
    
    Parameters:
    
    fieldMatcher - field matcher
  - getOnSet
```
public PropertySetter getOnSet()
```
    Gets the property setter to use when a value is set.
    
    Returns:
    
    property setter
    
    Since:
    
    3.0.0
  - setOnSet
```
public void setOnSet(PropertySetter onSet)
```
    Sets the property setter to use when a value is set.
    
    Parameters:
    
    onSet - property setter
    
    Since:
    
    3.0.0
  - getMaxReadSize
```
public int getMaxReadSize()
```
    Gets the maximum number of characters to read from content for tagging at once. Default is TextReader.DEFAULT_MAX_READ_SIZE.
    
    Returns:
    
    maximum read size
  - setMaxReadSize
```
public void setMaxReadSize(int maxReadSize)
```
    Sets the maximum number of characters to read from content for tagging at once.
    
    Parameters:
    
    maxReadSize - maximum read size
  - loadCharStreamTaggerFromXML
```
protected void loadCharStreamTaggerFromXML(XML xml)
```
    Description copied from class: AbstractCharStreamTagger
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadCharStreamTaggerFromXML in class AbstractCharStreamTagger
    
    Parameters:
    
    xml - xml configuration
  - saveCharStreamTaggerToXML
```
protected void saveCharStreamTaggerToXML(XML xml)
```
    Description copied from class: AbstractCharStreamTagger
    
    Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.
    
    Specified by:
    
    saveCharStreamTaggerToXML in class AbstractCharStreamTagger
    
    Parameters:
    
    xml - the XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractCharStreamTagger
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractCharStreamTagger
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractCharStreamTagger

Class URLExtractorTagger

Storing values in an existing field

Content source

XML configuration usage:

XML usage example:

Constructor Summary

Method Summary

Methods inherited from class com.norconex.importer.handler.tagger.AbstractCharStreamTagger

Methods inherited from class com.norconex.importer.handler.tagger.AbstractDocumentTagger

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler

Methods inherited from class java.lang.Object

Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable

Constructor Detail

URLExtractorTagger

Method Detail

tagTextDocument

getToField

setToField

getFieldMatcher

setFieldMatcher

getOnSet

setOnSet

getMaxReadSize

setMaxReadSize

loadCharStreamTaggerFromXML

saveCharStreamTaggerToXML

equals

hashCode

toString