TikaLinkExtractor (Norconex HTTP Collector 2.9.1 API)

java.lang.Object
- com.norconex.collector.http.url.impl.TikaLinkExtractor

All Implemented Interfaces:

ILinkExtractor, IXMLConfigurable
```
public class TikaLinkExtractor
extends Object
implements ILinkExtractor, IXMLConfigurable
```
Implementation of ILinkExtractor using Apache Tika to perform URL extractions from HTML documents. This is an alternative to the GenericLinkExtractor.

The configuration of content-types, storing the referrer data, and ignoring "nofollow" are the same as in GenericLinkExtractor.

URL Fragments

Since 2.3.0, this extractor preserves hashtag characters (#) found in URLs and every characters after it. It relies on the implementation of IURLNormalizer to strip it if need be. Still since 2.3.0, GenericURLNormalizer is now always invoked by default, and the default set of rules defined for it will remove fragments.

XML configuration usage
```
  <extractor class="com.norconex.collector.http.url.impl.TikaLinkExtractor"
          ignoreNofollow="(false|true)" >
      <contentTypes>
          (CSV list of content types on which to perform link extraction.
           leave blank or remove tag to use defaults.)
      </contentTypes>
  </extractor>
 
```
Author:

Pascal Essiembre

See Also:

GenericLinkExtractor

Constructor Summary

Constructors
Constructor and Description

TikaLinkExtractor()

Constructors
Constructor and Description
`TikaLinkExtractor()`

Method Summary

All Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`boolean`	`accepts(String url, ContentType contentType)` Whether this link extraction should be executed for the given URL and/or content type.
`boolean`	`equals(Object other)`
`Set<Link>`	`extractLinks(InputStream is, String url, ContentType contentType)` Extracts links from a document.
`ContentType[]`	`getContentTypes()`
`int`	`hashCode()`
`boolean`	`isIgnoreNofollow()`
`boolean`	`isKeepReferrerData()` Deprecated. Since 2.6.0, referrer data is always kept
`void`	`loadFromXML(Reader in)`
`void`	`saveToXML(Writer out)`
`void`	`setContentTypes(ContentType... contentTypes)`
`void`	`setIgnoreNofollow(boolean ignoreNofollow)`
`void`	`setKeepReferrerData(boolean keepReferrerData)` Deprecated. Since 2.6.0, referrer data is always kept
`String`	`toString()`

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - TikaLinkExtractor
```
public TikaLinkExtractor()
```
- Method Detail
  - extractLinks
```
public Set<Link> extractLinks(InputStream is,
                              String url,
                              ContentType contentType)
                       throws IOException
```
    Description copied from interface: ILinkExtractor
    
    Extracts links from a document.
    
    Specified by:
    
    extractLinks in interface ILinkExtractor
    
    Parameters:
    
    is - the document input stream
    
    url - document reference (URL)
    
    contentType - the document content type
    
    Returns:
    
    a set of links
    
    Throws:
    
    IOException - problem reading the document
  - getContentTypes
```
public ContentType[] getContentTypes()
```
  - setContentTypes
```
public void setContentTypes(ContentType... contentTypes)
```
  - isIgnoreNofollow
```
public boolean isIgnoreNofollow()
```
  - setIgnoreNofollow
```
public void setIgnoreNofollow(boolean ignoreNofollow)
```
  - isKeepReferrerData
```
@Deprecated
public boolean isKeepReferrerData()
```
    Deprecated. Since 2.6.0, referrer data is always kept
    
    Gets whether to keep referrer data. Since 2.6.0, always return true.
    
    Returns:
    
    true
  - setKeepReferrerData
```
@Deprecated
public void setKeepReferrerData(boolean keepReferrerData)
```
    Deprecated. Since 2.6.0, referrer data is always kept
    
    Sets whether to keep the referrer data. Since 2.6.0, this method has no effect.
    
    Parameters:
    
    keepReferrerData - referrer data
  - accepts
```
public boolean accepts(String url,
                       ContentType contentType)
```
    Description copied from interface: ILinkExtractor
    
    Whether this link extraction should be executed for the given URL and/or content type.
    
    Specified by:
    
    accepts in interface ILinkExtractor
    
    Parameters:
    
    url - the url
    
    contentType - the content type
    
    Returns:
    
    true if the given URL is accepted
  - loadFromXML
```
public void loadFromXML(Reader in)
```
    Specified by:
    
    loadFromXML in interface IXMLConfigurable
  - saveToXML
```
public void saveToXML(Writer out)
               throws IOException
```
    Specified by:
    
    saveToXML in interface IXMLConfigurable
    
    Throws:
    
    IOException
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class Object
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class Object

Class TikaLinkExtractor

URL Fragments

XML configuration usage

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

TikaLinkExtractor

Method Detail

extractLinks

getContentTypes

setContentTypes

isIgnoreNofollow

setIgnoreNofollow

isKeepReferrerData

setKeepReferrerData

accepts

loadFromXML

saveToXML

toString

equals

hashCode