GenericLinkExtractor (Norconex HTTP Collector 2.9.1 API)

java.lang.Object
- com.norconex.collector.http.url.impl.GenericLinkExtractor

All Implemented Interfaces:

ILinkExtractor, IXMLConfigurable
```
public class GenericLinkExtractor
extends Object
implements ILinkExtractor, IXMLConfigurable
```
Generic link extractor for URLs found in HTML and possibly other text files.
Content-types
By default, this extractor will look for URLs only in documents matching one of these content types:
```
 text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
 
```
You can specify your own content types if you know they represent a file with HTML-like markup tags containing URLs. For documents that are just too different, consider implementing your own ILinkExtractor instead. Removing the default values and define no content types will have for effect to try to extract URLs from all files (usually a bad idea).
Tags attributes
URLs are assumed to be contained within valid tags or tag attributes. The default tags and attributes used are (tag.attribute):
```
 a.href, frame.src, iframe.src, img.src, meta.http-equiv
 
```
You can specify your own set of tags and attributes to have different ones used for extracting URLs. For an elaborated set, you can combine the above with your own list or use any of the following suggestions (tag.attribute):
```
 applet.archive,   applet.codebase,  area.href,         audio.src,
 base.href,        blockquote.cite,  body.background,   button.formaction,
 command.icon,     del.cite,         embed.src,         form.action,
 frame.longdesc,   head.profile,     html.manifest,     iframe.longdesc,
 img.longdesc,     img.usemap,       input.formaction,  input.src,
 input.usemap,     ins.cite,         link.href,         object.archive,
 object.classid,   object.codebase,  object.data,       object.usemap,
 q.cite,           script.src,       source.src,        video.poster,
 video.src
 
```
The meta.http-equiv is treated differently. Only if the "http-equiv" value is refresh and a "content" tag with a URL exist that it will be extracted. "object" and "applet" can have multiple URLs.

Since 2.2.0, it is possible to identify a tag only as the holder of a URL (without attributes). The tag body value will be used as the URL.

Referrer data

The following referrer information is stored as metadata in each document represented by the extracted URLs:
- Referrer reference: The reference (URL) of the page where the link to a document was found. Metadata value is HttpMetadata.COLLECTOR_REFERRER_REFERENCE.
- Referrer link tag: The tag and attribute names of the link that contained the document reference (URL) in referrer's content. Metadata value is HttpMetadata.COLLECTOR_REFERRER_LINK_TAG.
- Referrer link text: The text between the <a href=""></a> tags of the referrer document. Can be useful to help establish better document titles. Metadata value is HttpMetadata.COLLECTOR_REFERRER_LINK_TEXT.
- Referrer link title: The title attribute of the link that contained the document reference (URL) in referrer's content. Can also be useful to help establish better document titles. Metadata value is HttpMetadata.COLLECTOR_REFERRER_LINK_TITLE.
Since 2.6.0, the referrer data is always stored (was optional before).

Character encoding

Since 2.4.0, this extractor will by default attempt to detect the encoding of the a page when extracting links and referrer information. If no charset could be detected, it falls back to UTF-8. It is also possible to dictate which encoding to use with setCharset(String).

"nofollow"

By default, a regular HTML link having the "rel" attribute set to "nofollow" won't be extracted (e.g. <a href="x.html" rel="nofollow" ...>). To force its extraction (and ensure it is followed) you can set setIgnoreNofollow(boolean) to true.

Since 2.6.0 it is possible to treat all the links in certain pages as "nofollow" links. Link extraction is essentially skipped for URLs matching the patterns set in setNofollowPatterns(List).

URL Fragments

Since 2.3.0, this extractor preserves hashtag characters (#) found in URLs and every characters after it. It relies on the implementation of IURLNormalizer to strip it if need be. GenericURLNormalizer is now always invoked by default, and the default set of rules defined for it will remove fragments.

The URL specification says hashtags are used to represent fragments only. That is, to quickly jump to a specific section of the page the URL represents. Under normal circumstances, keeping the URL fragments usually leads to duplicates documents being fetched (same URL but different fragment) and they should be stripped. Unfortunately, there are sites not following the URL standard and using hashtags as a regular part of a URL (i.e. different hashtags point to different web pages). It may be essential when crawling these sites to keep the URL fragments. This can be done by making sure the URL normalizer does not strip them.

URL Schemes

Since 2.4.0, only valid schemes are extracted for absolute URLs. By default, those are http, https, and ftp. You can specify your own list of supported protocols with setSchemes(String[]).

HTML/XML Comments

Since 2.6.0, URLs found in  are no longer extracted by default. To enable URL extraction from comments, use setCommentsEnabled(boolean)

Extract links in certain parts only

Since 2.8.0, you can identify portions of a document where links should be extracted or ignored with setExtractBetweens(RegexPair...) and setNoExtractBetweens(RegexPair...). Eligible content for link extraction is identified first, and content to exclude is done on that subset.

Since 2.9.0, you can further limit link extraction to specific area by using selector-syntax to do so, with setExtractSelectors(String...) and setNoExtractSelectors(String...).

XML configuration usage:
```
  <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor"
          maxURLLength="(maximum URL length. Default is 2048)"
          ignoreNofollow="[false|true]"
          commentsEnabled="[false|true]"
          charset="(supported character encoding)" >
      <contentTypes>
          (CSV list of content types on which to perform link extraction.
           leave blank or remove tag to use defaults.)
      </contentTypes>
      <schemes>
          (CSV list of URI scheme for which to perform link extraction.
           leave blank or remove tag to use defaults.)
      </schemes>

      
      <tags>
          <tag name="(tag name)" attribute="(tag attribute)" />
          
      </tags>

      
      <extractBetween caseSensitive="[false|true]">
          <start>(regex)</start>
          <end>(regex)</end>
      </extractBetween>
      

      
      <noExtractBetween caseSensitive="[false|true]">
          <start>(regex)</start>
          <end>(regex)</end>
      </noExtractBetween>
      

      
      <extractSelector>(selector)</extractSelector>
      

      
      <noExtractSelector>(selector)</noExtractSelector>
      

  </extractor>
 
```
Usage example:

The following adds URLs to JavaScript files to the list of URLs to be extracted.
```
  <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
      <tags>
          <tag name="a" attribute="href" />
          <tag name="frame" attribute="src" />
          <tag name="iframe" attribute="src" />
          <tag name="img" attribute="src" />
          <tag name="meta" attribute="http-equiv" />
          <tag name="script" attribute="src" />
      </tags>
  </extractor>
 
```
Since:

2.3.0

Author:

Pascal Essiembre

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class GenericLinkExtractor.RegexPair

Nested Classes
Modifier and Type	Class and Description
`static class`	`GenericLinkExtractor.RegexPair`

Field Summary

Fields
Modifier and Type Field and Description

static int DEFAULT_MAX_URL_LENGTH
Default maximum length a URL can have.

static int MAX_BUFFER_SIZE

static int OVERLAP_SIZE

Fields
Modifier and Type	Field and Description
`static int`	`DEFAULT_MAX_URL_LENGTH` Default maximum length a URL can have.
`static int`	`MAX_BUFFER_SIZE`
`static int`	`OVERLAP_SIZE`

Constructor Summary

Constructors
Constructor and Description

GenericLinkExtractor()

Constructors
Constructor and Description
`GenericLinkExtractor()`

Method Summary

All Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`boolean`	`accepts(String url, ContentType contentType)` Whether this link extraction should be executed for the given URL and/or content type.
`void`	`addExtractBetween(String start, String end, boolean caseSensitive)` Adds patterns delimiting a portion of a document to be considered for link extraction.
`void`	`addExtractSelectors(String... selectors)` Adds selectors matching the portions of a document to be considered for link extraction.
`void`	`addLinkTag(String tagName, String attribute)`
`void`	`addNoExtractBetween(String start, String end, boolean caseSensitive)` Adds patterns delimiting a portion of a document to be excluded from link extraction.
`void`	`addNoExtractSelectors(String... selectors)` Adds selectors matching the portions of a document to be excluded from link extraction.
`void`	`addNofollowPatterns(String regex)` Adds a pattern for references for which link extraction is disabled.
`void`	`clearLinkTags()`
`boolean`	`equals(Object other)`
`Set<Link>`	`extractLinks(InputStream input, String reference, ContentType contentType)` Extracts links from a document.
`String`	`getCharset()` Gets the character set of pages on which link extraction is performed.
`ContentType[]`	`getContentTypes()`
`GenericLinkExtractor.RegexPair[]`	`getExtractBetweens()` Gets the patterns delimiting the portions of a document to be considered for link extraction.
`String[]`	`getExtractSelectors()` Gets the selectors matching the portions of a document to be considered for link extraction.
`int`	`getMaxURLLength()` Gets the maximum supported URL length.
`GenericLinkExtractor.RegexPair[]`	`getNoExtractBetweens()` Gets the patterns delimiting the portions of a document to be excluded from link extraction.
`String[]`	`getNoExtractSelectors()` Gets the selectors matching the portions of a document to be excluded from link extraction.
`List<String>`	`getNofollowPatterns()` Gets the patterns of references for which link extraction is disabled.
`String[]`	`getSchemes()` Gets the schemes to be extracted.
`int`	`hashCode()`
`boolean`	`isCommentsEnabled()` Gets whether links should be extracted from HTML/XML comments.
`boolean`	`isIgnoreNofollow()`
`boolean`	`isKeepReferrerData()` Deprecated. Since 2.6.0, referrer data is always kept
`void`	`loadFromXML(Reader in)`
`void`	`removeLinkTag(String tagName, String attribute)`
`void`	`saveToXML(Writer out)`
`void`	`setCharset(String charset)` Sets the character set of pages on which link extraction is performed.
`void`	`setCommentsEnabled(boolean commentsEnabled)` Sets whether links should be extracted from HTML/XML comments.
`void`	`setContentTypes(ContentType... contentTypes)`
`void`	`setExtractBetweens(GenericLinkExtractor.RegexPair... betweens)` Sets the patterns delimiting the portions of a document to be considered for link extraction.
`void`	`setExtractSelectors(String... selectors)` Sets the selectors matching the portions of a document to be considered for link extraction.
`void`	`setIgnoreNofollow(boolean ignoreNofollow)`
`void`	`setKeepReferrerData(boolean keepReferrerData)` Deprecated. Since 2.6.0, referrer data is always kept
`void`	`setMaxURLLength(int maxURLLength)` Sets the maximum supported URL length.
`void`	`setNoExtractBetweens(GenericLinkExtractor.RegexPair... betweens)` Sets the patterns delimiting the portions of a document to be excluded from link extraction.
`void`	`setNoExtractSelectors(String... selectors)` Sets the selectors matching the portions of a document to be excluded from link extraction.
`void`	`setNofollowPatterns(List<String> patterns)` Sets the patterns of references for which link extraction is disabled.
`void`	`setSchemes(String... schemes)` Sets the schemes to be extracted.
`String`	`toString()`

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Field Detail
  - MAX_BUFFER_SIZE
```
public static final int MAX_BUFFER_SIZE
```
    See Also:
    
    Constant Field Values
  - OVERLAP_SIZE
```
public static final int OVERLAP_SIZE
```
    See Also:
    
    Constant Field Values
  - DEFAULT_MAX_URL_LENGTH
```
public static final int DEFAULT_MAX_URL_LENGTH
```
    Default maximum length a URL can have.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - GenericLinkExtractor
```
public GenericLinkExtractor()
```
- Method Detail
  - extractLinks
```
public Set<Link> extractLinks(InputStream input,
                              String reference,
                              ContentType contentType)
                       throws IOException
```
    Description copied from interface: ILinkExtractor
    
    Extracts links from a document.
    
    Specified by:
    
    extractLinks in interface ILinkExtractor
    
    Parameters:
    
    input - the document input stream
    
    reference - document reference (URL)
    
    contentType - the document content type
    
    Returns:
    
    a set of links
    
    Throws:
    
    IOException - problem reading the document
  - accepts
```
public boolean accepts(String url,
                       ContentType contentType)
```
    Description copied from interface: ILinkExtractor
    
    Whether this link extraction should be executed for the given URL and/or content type.
    
    Specified by:
    
    accepts in interface ILinkExtractor
    
    Parameters:
    
    url - the url
    
    contentType - the content type
    
    Returns:
    
    true if the given URL is accepted
  - getMaxURLLength
```
public int getMaxURLLength()
```
    Gets the maximum supported URL length.
    
    Returns:
    
    maximum URL length
  - setMaxURLLength
```
public void setMaxURLLength(int maxURLLength)
```
    Sets the maximum supported URL length.
    
    Parameters:
    
    maxURLLength - maximum URL length
  - getContentTypes
```
public ContentType[] getContentTypes()
```
  - setContentTypes
```
public void setContentTypes(ContentType... contentTypes)
```
  - getExtractBetweens
```
public GenericLinkExtractor.RegexPair[] getExtractBetweens()
```
    Gets the patterns delimiting the portions of a document to be considered for link extraction.
    
    Returns:
    
    extract between patterns
    
    Since:
    
    2.8.0
  - setExtractBetweens
```
public void setExtractBetweens(GenericLinkExtractor.RegexPair... betweens)
```
    Sets the patterns delimiting the portions of a document to be considered for link extraction.
    
    Parameters:
    
    betweens - extract between patterns
    
    Since:
    
    2.8.0
  - addExtractBetween
```
public void addExtractBetween(String start,
                              String end,
                              boolean caseSensitive)
```
    Adds patterns delimiting a portion of a document to be considered for link extraction.
    
    Parameters:
    
    start - pattern matching start of text portion
    
    end - pattern matching end of text portion
    
    caseSensitive - whether the patterns are case sensitive or not
    
    Since:
    
    2.8.0
  - getNoExtractBetweens
```
public GenericLinkExtractor.RegexPair[] getNoExtractBetweens()
```
    Gets the patterns delimiting the portions of a document to be excluded from link extraction.
    
    Returns:
    
    extract between patterns
    
    Since:
    
    2.8.0
  - setNoExtractBetweens
```
public void setNoExtractBetweens(GenericLinkExtractor.RegexPair... betweens)
```
    Sets the patterns delimiting the portions of a document to be excluded from link extraction.
    
    Parameters:
    
    betweens - extract between patterns
    
    Since:
    
    2.8.0
  - addNoExtractBetween
```
public void addNoExtractBetween(String start,
                                String end,
                                boolean caseSensitive)
```
    Adds patterns delimiting a portion of a document to be excluded from link extraction.
    
    Parameters:
    
    start - pattern matching start of text portion
    
    end - pattern matching end of text portion
    
    caseSensitive - whether the patterns are case sensitive or not
    
    Since:
    
    2.8.0
  - getExtractSelectors
```
public String[] getExtractSelectors()
```
    Gets the selectors matching the portions of a document to be considered for link extraction.
    
    Returns:
    
    selectors
    
    Since:
    
    2.9.0
  - setExtractSelectors
```
public void setExtractSelectors(String... selectors)
```
    Sets the selectors matching the portions of a document to be considered for link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    2.9.0
  - addExtractSelectors
```
public void addExtractSelectors(String... selectors)
```
    Adds selectors matching the portions of a document to be considered for link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    2.9.0
  - getNoExtractSelectors
```
public String[] getNoExtractSelectors()
```
    Gets the selectors matching the portions of a document to be excluded from link extraction.
    
    Returns:
    
    selectors
    
    Since:
    
    2.9.0
  - setNoExtractSelectors
```
public void setNoExtractSelectors(String... selectors)
```
    Sets the selectors matching the portions of a document to be excluded from link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    2.9.0
  - addNoExtractSelectors
```
public void addNoExtractSelectors(String... selectors)
```
    Adds selectors matching the portions of a document to be excluded from link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    2.9.0
  - getNofollowPatterns
```
public List<String> getNofollowPatterns()
```
    Gets the patterns of references for which link extraction is disabled.
    
    Returns:
    
    nofollow regex patterns
    
    Since:
    
    2.9.0
  - setNofollowPatterns
```
public void setNofollowPatterns(List<String> patterns)
```
    Sets the patterns of references for which link extraction is disabled.
    
    Parameters:
    
    patterns - the list of regex URL patterns
    
    Since:
    
    2.9.0
  - addNofollowPatterns
```
public void addNofollowPatterns(String regex)
```
    Adds a pattern for references for which link extraction is disabled.
    
    Parameters:
    
    regex - the regex URL pattern
    
    Since:
    
    2.9.0
  - isCommentsEnabled
```
public boolean isCommentsEnabled()
```
    Gets whether links should be extracted from HTML/XML comments.
    
    Returns:
    
    true if links should be extracted from comments.
    
    Since:
    
    2.6.0
  - setCommentsEnabled
```
public void setCommentsEnabled(boolean commentsEnabled)
```
    Sets whether links should be extracted from HTML/XML comments.
    
    Parameters:
    
    commentsEnabled - true if links should be extracted from comments.
    
    Since:
    
    2.6.0
  - getSchemes
```
public String[] getSchemes()
```
    Gets the schemes to be extracted.
    
    Returns:
    
    schemes to be extracted
    
    Since:
    
    2.4.0
  - setSchemes
```
public void setSchemes(String... schemes)
```
    Sets the schemes to be extracted.
    
    Parameters:
    
    schemes - schemes to be extracted
    
    Since:
    
    2.4.0
  - isIgnoreNofollow
```
public boolean isIgnoreNofollow()
```
  - setIgnoreNofollow
```
public void setIgnoreNofollow(boolean ignoreNofollow)
```
  - isKeepReferrerData
```
@Deprecated
public boolean isKeepReferrerData()
```
    Deprecated. Since 2.6.0, referrer data is always kept
    
    Gets whether to keep referrer data. Since 2.6.0, always return true.
    
    Returns:
    
    true
  - setKeepReferrerData
```
@Deprecated
public void setKeepReferrerData(boolean keepReferrerData)
```
    Deprecated. Since 2.6.0, referrer data is always kept
    
    Sets whether to keep the referrer data. Since 2.6.0, this method has no effect.
    
    Parameters:
    
    keepReferrerData - referrer data
  - getCharset
```
public String getCharset()
```
    Gets the character set of pages on which link extraction is performed. Default is null (charset detection will be attempted).
    
    Returns:
    
    character set to use, or null
    
    Since:
    
    2.4.0
  - setCharset
```
public void setCharset(String charset)
```
    Sets the character set of pages on which link extraction is performed. Not specifying any (null) will attempt charset detection.
    
    Parameters:
    
    charset - character set to use, or null
    
    Since:
    
    2.4.0
  - addLinkTag
```
public void addLinkTag(String tagName,
                       String attribute)
```
  - removeLinkTag
```
public void removeLinkTag(String tagName,
                          String attribute)
```
  - clearLinkTags
```
public void clearLinkTags()
```
  - loadFromXML
```
public void loadFromXML(Reader in)
```
    Specified by:
    
    loadFromXML in interface IXMLConfigurable
  - saveToXML
```
public void saveToXML(Writer out)
               throws IOException
```
    Specified by:
    
    saveToXML in interface IXMLConfigurable
    
    Throws:
    
    IOException
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class Object
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class Object

Class GenericLinkExtractor

Content-types

Tags attributes

Referrer data

Character encoding

"nofollow"

URL Fragments

URL Schemes

HTML/XML Comments

Extract links in certain parts only

XML configuration usage:

Usage example:

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

MAX_BUFFER_SIZE

OVERLAP_SIZE

DEFAULT_MAX_URL_LENGTH

Constructor Detail

GenericLinkExtractor

Method Detail

extractLinks

accepts

getMaxURLLength

setMaxURLLength

getContentTypes

setContentTypes

getExtractBetweens

setExtractBetweens

addExtractBetween

getNoExtractBetweens

setNoExtractBetweens

addNoExtractBetween

getExtractSelectors

setExtractSelectors

addExtractSelectors

getNoExtractSelectors

setNoExtractSelectors

addNoExtractSelectors

getNofollowPatterns

setNofollowPatterns

addNofollowPatterns

isCommentsEnabled

setCommentsEnabled

getSchemes

setSchemes

isIgnoreNofollow

setIgnoreNofollow

isKeepReferrerData

setKeepReferrerData

getCharset

setCharset

addLinkTag

removeLinkTag

clearLinkTags

loadFromXML

saveToXML

toString

equals

hashCode