java.lang.Object
- com.norconex.collector.http.link.AbstractLinkExtractor
- - com.norconex.collector.http.link.AbstractTextLinkExtractor
  - - com.norconex.collector.http.link.impl.HtmlLinkExtractor

All Implemented Interfaces:

ILinkExtractor, IXMLConfigurable

Direct Known Subclasses:

GenericLinkExtractor
```
public class HtmlLinkExtractor
extends AbstractTextLinkExtractor
```
Html link extractor for URLs found in HTML and possibly other text files.

This link extractor uses regular expressions to extract links. It does so on a chunk of text at a time, so that large files are not fully loaded into memory. If you prefer a more flexible implementation that loads the DOM model in memory to perform link extraction, consider using DOMLinkExtractor.

Applicable documents

By default, this extractor only will be applied on documents matching one of these content types:

You can specify your own content types or other restrictions with AbstractLinkExtractor.setRestrictions(List). Make sure they represent a file with HTML-like markup tags containing URLs. For documents that are just too different, consider implementing your own ILinkExtractor instead. Removing the default values and define no content types will have for effect to try to extract URLs from all files (usually a bad idea).

Tags attributes
URLs are assumed to be contained within valid tags or tag attributes. The default tags and attributes used are (tag.attribute):
```
 a.href, frame.src, iframe.src, img.src, meta.http-equiv
 
```
You can specify your own set of tags and attributes to have different ones used for extracting URLs. For an elaborated set, you can combine the above with your own list or use any of the following suggestions (tag.attribute):
```
 applet.archive,   applet.codebase,  area.href,         audio.src,
 base.href,        blockquote.cite,  body.background,   button.formaction,
 command.icon,     del.cite,         embed.src,         form.action,
 frame.longdesc,   head.profile,     html.manifest,     iframe.longdesc,
 img.longdesc,     img.usemap,       input.formaction,  input.src,
 input.usemap,     ins.cite,         link.href,         object.archive,
 object.classid,   object.codebase,  object.data,       object.usemap,
 q.cite,           script.src,       source.src,        video.poster,
 video.src
 
```
The meta.http-equiv is treated differently. Only if the "http-equiv" value is refresh and a "content" tag with a URL exist that it will be extracted. "object" and "applet" can have multiple URLs.

Since 2.2.0, it is possible to identify a tag only as the holder of a URL (without attributes). The tag body value will be used as the URL.

Referrer data

Some "referrer" information is derived from the each link and stored as metadata in the document they point to. These may vary for each link, but they are normally prefixed with HttpDocMetadata.REFERRER_LINK_PREFIX.

Since 2.6.0, the referrer data is always stored (was optional before).

Character encoding

Since 2.4.0, this extractor will by default attempt to detect the encoding of the a page when extracting links and referrer information. If no charset could be detected, it falls back to UTF-8. It is also possible to dictate which encoding to use with setCharset(String).

"nofollow"

By default, a regular HTML link having the "rel" attribute set to "nofollow" won't be extracted (e.g. <a href="x.html" rel="nofollow" ...>). To force its extraction (and ensure it is followed) you can set setIgnoreNofollow(boolean) to true.

URL Fragments

Since 2.3.0, this extractor preserves hashtag characters (#) found in URLs and every characters after it. It relies on the implementation of IURLNormalizer to strip it if need be. GenericURLNormalizer is now always invoked by default, and the default set of rules defined for it will remove fragments.

The URL specification says hashtags are used to represent fragments only. That is, to quickly jump to a specific section of the page the URL represents. Under normal circumstances, keeping the URL fragments usually leads to duplicates documents being fetched (same URL but different fragment) and they should be stripped. Unfortunately, there are sites not following the URL standard and using hashtags as a regular part of a URL (i.e. different hashtags point to different web pages). It may be essential when crawling these sites to keep the URL fragments. This can be done by making sure the URL normalizer does not strip them.

Ignoring link data

By default, contextual information is kept about the HTML/XML mark-up tag from which a link is extracted (e.g., tag name and attributes). That information gets stored as metadata in the target document. If you want to limit the quantity of information extracted/stored, you can disable this feature by setting ignoreLinkData to true.

URL Schemes

Since 2.4.0, only valid schemes are extracted for absolute URLs. By default, those are http, https, and ftp. You can specify your own list of supported protocols with setSchemes(String[]).

HTML/XML Comments

Since 2.6.0, URLs found in  are no longer extracted by default. To enable URL extraction from comments, use setCommentsEnabled(boolean)

Extract links in certain parts only

Since 2.8.0, you can identify portions of a document where links should be extracted or ignored with setExtractBetweens(RegexPair...) and setNoExtractBetweens(RegexPair...). Eligible content for link extraction is identified first, and content to exclude is done on that subset.

Since 2.9.0, you can further limit link extraction to specific area by using selector-syntax to do so, with setExtractSelectors(String...) and setNoExtractSelectors(String...).

XML configuration usage:
```
<extractor
    class="com.norconex.collector.http.link.impl.HtmlLinkExtractor"
    maxURLLength="(maximum URL length. Default is 2048)"
    ignoreNofollow="[false|true]"
    ignoreLinkData="[false|true]"
    commentsEnabled="[false|true]"
    charset="(supported character encoding)">
  <fieldMatcher>
    (optional expression for fields used for links extraction instead
     of the document stream)
  </fieldMatcher>
  <schemes>
    (CSV list of URI scheme for which to perform link extraction.
     leave blank or remove tag to use defaults.)
  </schemes>
  
  <tags>
    <tag
        name="(tag name)"
        attribute="(tag attribute)"/>
    
  </tags>
  
  <extractBetween
      caseSensitive="[false|true]">
    <start>(regex)</start>
    <end>(regex)</end>
  </extractBetween>
  
  
  <noExtractBetween
      caseSensitive="[false|true]">
    <start>(regex)</start>
    <end>(regex)</end>
  </noExtractBetween>
  
  
  <extractSelector>(selector)</extractSelector>
  
  
  <noExtractSelector>(selector)</noExtractSelector>
  
</extractor>
```
XML usage example:
```
<extractor
    class="com.norconex.collector.http.link.impl.HtmlLinkExtractor">
  <tags>
    <tag
        name="a"
        attribute="href"/>
    <tag
        name="frame"
        attribute="src"/>
    <tag
        name="iframe"
        attribute="src"/>
    <tag
        name="img"
        attribute="src"/>
    <tag
        name="meta"
        attribute="http-equiv"/>
    <tag
        name="script"
        attribute="src"/>
  </tags>
</extractor>
```
The above example adds URLs to JavaScript files to the list of URLs to be extracted.
Since:

3.0.0 (refactored from GenericLinkExtractor)

Author:

Pascal Essiembre

Nested Class Summary

Nested Classes
Modifier and Type Class Description

static class HtmlLinkExtractor.RegexPair

Field Summary

Fields
Modifier and Type Field Description

static int DEFAULT_MAX_URL_LENGTH
Default maximum length a URL can have.

static int MAX_BUFFER_SIZE

static int OVERLAP_SIZE

Constructor Summary

Constructors
Constructor Description

HtmlLinkExtractor()

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`void`	`addExtractBetween(String start, String end, boolean caseSensitive)`	Adds patterns delimiting a portion of a document to be considered for link extraction.
`void`	`addExtractSelectors(String... selectors)`	Adds selectors matching the portions of a document to be considered for link extraction.
`void`	`addExtractSelectors(List<String> selectors)`	Adds selectors matching the portions of a document to be considered for link extraction.
`void`	`addLinkTag(String tagName, String attribute)`
`void`	`addNoExtractBetween(String start, String end, boolean caseSensitive)`	Adds patterns delimiting a portion of a document to be excluded from link extraction.
`void`	`addNoExtractSelectors(String... selectors)`	Adds selectors matching the portions of a document to be excluded from link extraction.
`void`	`addNoExtractSelectors(List<String> selectors)`	Adds selectors matching the portions of a document to be excluded from link extraction.
`void`	`clearLinkTags()`
`boolean`	`equals(Object other)`
`void`	`extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader)`
`String`	`getCharset()`	Gets the character set of pages on which link extraction is performed.
`List<HtmlLinkExtractor.RegexPair>`	`getExtractBetweens()`	Gets the patterns delimiting the portions of a document to be considered for link extraction.
`List<String>`	`getExtractSelectors()`	Gets the selectors matching the portions of a document to be considered for link extraction.
`int`	`getMaxURLLength()`	Gets the maximum supported URL length.
`List<HtmlLinkExtractor.RegexPair>`	`getNoExtractBetweens()`	Gets the patterns delimiting the portions of a document to be excluded from link extraction.
`List<String>`	`getNoExtractSelectors()`	Gets the selectors matching the portions of a document to be excluded from link extraction.
`List<String>`	`getSchemes()`	Gets the schemes to be extracted.
`int`	`hashCode()`
`boolean`	`isCommentsEnabled()`	Gets whether links should be extracted from HTML/XML comments.
`boolean`	`isIgnoreLinkData()`	Gets whether to ignore extra data associated with a link.
`boolean`	`isIgnoreNofollow()`
`protected void`	`loadTextLinkExtractorFromXML(XML xml)`	Loads configuration settings specific to the implementing class.
`void`	`removeLinkTag(String tagName, String attribute)`
`protected void`	`saveTextLinkExtractorToXML(XML xml)`	Saves configuration settings specific to the implementing class.
`void`	`setCharset(String charset)`	Sets the character set of pages on which link extraction is performed.
`void`	`setCommentsEnabled(boolean commentsEnabled)`	Sets whether links should be extracted from HTML/XML comments.
`void`	`setExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)`	Sets the patterns delimiting the portions of a document to be considered for link extraction.
`void`	`setExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)`	Sets the patterns delimiting the portions of a document to be considered for link extraction.
`void`	`setExtractSelectors(String... selectors)`	Sets the selectors matching the portions of a document to be considered for link extraction.
`void`	`setExtractSelectors(List<String> selectors)`	Sets the selectors matching the portions of a document to be considered for link extraction.
`void`	`setIgnoreLinkData(boolean ignoreLinkData)`	Sets whether to ignore extra data associated with a link.
`void`	`setIgnoreNofollow(boolean ignoreNofollow)`
`void`	`setMaxURLLength(int maxURLLength)`	Sets the maximum supported URL length.
`void`	`setNoExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)`	Sets the patterns delimiting the portions of a document to be excluded from link extraction.
`void`	`setNoExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)`	Sets the patterns delimiting the portions of a document to be excluded from link extraction.
`void`	`setNoExtractSelectors(String... selectors)`	Sets the selectors matching the portions of a document to be excluded from link extraction.
`void`	`setNoExtractSelectors(List<String> selectors)`	Sets the selectors matching the portions of a document to be excluded from link extraction.
`void`	`setSchemes(String... schemes)`	Sets the schemes to be extracted.
`void`	`setSchemes(List<String> schemes)`	Sets the schemes to be extracted.
`String`	`toString()`

Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcher

Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Field Detail
  - MAX_BUFFER_SIZE
```
public static final int MAX_BUFFER_SIZE
```
    See Also:
    
    Constant Field Values
  - OVERLAP_SIZE
```
public static final int OVERLAP_SIZE
```
    See Also:
    
    Constant Field Values
  - DEFAULT_MAX_URL_LENGTH
```
public static final int DEFAULT_MAX_URL_LENGTH
```
    Default maximum length a URL can have.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - HtmlLinkExtractor
```
public HtmlLinkExtractor()
```
- Method Detail
  - extractTextLinks
```
public void extractTextLinks(Set<Link> links,
                             HandlerDoc doc,
                             Reader reader)
                      throws IOException
```
    Specified by:
    
    extractTextLinks in class AbstractTextLinkExtractor
    
    Throws:
    
    IOException
  - getMaxURLLength
```
public int getMaxURLLength()
```
    Gets the maximum supported URL length.
    
    Returns:
    
    maximum URL length
  - setMaxURLLength
```
public void setMaxURLLength(int maxURLLength)
```
    Sets the maximum supported URL length.
    
    Parameters:
    
    maxURLLength - maximum URL length
  - getExtractBetweens
```
public List<HtmlLinkExtractor.RegexPair> getExtractBetweens()
```
    Gets the patterns delimiting the portions of a document to be considered for link extraction.
    
    Returns:
    
    extract between patterns
    
    Since:
    
    2.8.0
  - setExtractBetweens
```
public void setExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)
```
    Sets the patterns delimiting the portions of a document to be considered for link extraction.
    
    Parameters:
    
    betweens - extract between patterns
    
    Since:
    
    2.8.0
  - setExtractBetweens
```
public void setExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)
```
    Sets the patterns delimiting the portions of a document to be considered for link extraction.
    
    Parameters:
    
    betweens - extract between patterns
    
    Since:
    
    3.0.0
  - addExtractBetween
```
public void addExtractBetween(String start,
                              String end,
                              boolean caseSensitive)
```
    Adds patterns delimiting a portion of a document to be considered for link extraction.
    
    Parameters:
    
    start - pattern matching start of text portion
    
    end - pattern matching end of text portion
    
    caseSensitive - whether the patterns are case sensitive or not
    
    Since:
    
    2.8.0
  - getNoExtractBetweens
```
public List<HtmlLinkExtractor.RegexPair> getNoExtractBetweens()
```
    Gets the patterns delimiting the portions of a document to be excluded from link extraction.
    
    Returns:
    
    extract between patterns
    
    Since:
    
    2.8.0
  - setNoExtractBetweens
```
public void setNoExtractBetweens(HtmlLinkExtractor.RegexPair... betweens)
```
    Sets the patterns delimiting the portions of a document to be excluded from link extraction.
    
    Parameters:
    
    betweens - extract between patterns
    
    Since:
    
    2.8.0
  - setNoExtractBetweens
```
public void setNoExtractBetweens(List<HtmlLinkExtractor.RegexPair> betweens)
```
    Sets the patterns delimiting the portions of a document to be excluded from link extraction.
    
    Parameters:
    
    betweens - extract between patterns
    
    Since:
    
    3.0.0
  - addNoExtractBetween
```
public void addNoExtractBetween(String start,
                                String end,
                                boolean caseSensitive)
```
    Adds patterns delimiting a portion of a document to be excluded from link extraction.
    
    Parameters:
    
    start - pattern matching start of text portion
    
    end - pattern matching end of text portion
    
    caseSensitive - whether the patterns are case sensitive or not
    
    Since:
    
    2.8.0
  - getExtractSelectors
```
public List<String> getExtractSelectors()
```
    Gets the selectors matching the portions of a document to be considered for link extraction.
    
    Returns:
    
    selectors
    
    Since:
    
    2.9.0
  - setExtractSelectors
```
public void setExtractSelectors(String... selectors)
```
    Sets the selectors matching the portions of a document to be considered for link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    2.9.0
  - setExtractSelectors
```
public void setExtractSelectors(List<String> selectors)
```
    Sets the selectors matching the portions of a document to be considered for link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    3.0.0
  - addExtractSelectors
```
public void addExtractSelectors(String... selectors)
```
    Adds selectors matching the portions of a document to be considered for link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    2.9.0
  - addExtractSelectors
```
public void addExtractSelectors(List<String> selectors)
```
    Adds selectors matching the portions of a document to be considered for link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    3.0.0
  - getNoExtractSelectors
```
public List<String> getNoExtractSelectors()
```
    Gets the selectors matching the portions of a document to be excluded from link extraction.
    
    Returns:
    
    selectors
    
    Since:
    
    2.9.0
  - setNoExtractSelectors
```
public void setNoExtractSelectors(String... selectors)
```
    Sets the selectors matching the portions of a document to be excluded from link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    2.9.0
  - setNoExtractSelectors
```
public void setNoExtractSelectors(List<String> selectors)
```
    Sets the selectors matching the portions of a document to be excluded from link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    3.0.0
  - addNoExtractSelectors
```
public void addNoExtractSelectors(String... selectors)
```
    Adds selectors matching the portions of a document to be excluded from link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    2.9.0
  - addNoExtractSelectors
```
public void addNoExtractSelectors(List<String> selectors)
```
    Adds selectors matching the portions of a document to be excluded from link extraction.
    
    Parameters:
    
    selectors - selectors
    
    Since:
    
    3.0.0
  - isCommentsEnabled
```
public boolean isCommentsEnabled()
```
    Gets whether links should be extracted from HTML/XML comments.
    
    Returns:
    
    true if links should be extracted from comments.
    
    Since:
    
    2.6.0
  - setCommentsEnabled
```
public void setCommentsEnabled(boolean commentsEnabled)
```
    Sets whether links should be extracted from HTML/XML comments.
    
    Parameters:
    
    commentsEnabled - true if links should be extracted from comments.
    
    Since:
    
    2.6.0
  - getSchemes
```
public List<String> getSchemes()
```
    Gets the schemes to be extracted.
    
    Returns:
    
    schemes to be extracted
    
    Since:
    
    2.4.0
  - setSchemes
```
public void setSchemes(String... schemes)
```
    Sets the schemes to be extracted.
    
    Parameters:
    
    schemes - schemes to be extracted
    
    Since:
    
    2.4.0
  - setSchemes
```
public void setSchemes(List<String> schemes)
```
    Sets the schemes to be extracted.
    
    Parameters:
    
    schemes - schemes to be extracted
    
    Since:
    
    3.0.0
  - isIgnoreNofollow
```
public boolean isIgnoreNofollow()
```
  - setIgnoreNofollow
```
public void setIgnoreNofollow(boolean ignoreNofollow)
```
  - isIgnoreLinkData
```
public boolean isIgnoreLinkData()
```
    Gets whether to ignore extra data associated with a link.
    
    Returns:
    
    true to ignore.
    
    Since:
    
    3.0.0
  - setIgnoreLinkData
```
public void setIgnoreLinkData(boolean ignoreLinkData)
```
    Sets whether to ignore extra data associated with a link.
    
    Parameters:
    
    ignoreLinkData - true to ignore.
    
    Since:
    
    3.0.0
  - getCharset
```
public String getCharset()
```
    Gets the character set of pages on which link extraction is performed. Default is null (charset detection will be attempted).
    
    Returns:
    
    character set to use, or null
    
    Since:
    
    2.4.0
  - setCharset
```
public void setCharset(String charset)
```
    Sets the character set of pages on which link extraction is performed. Not specifying any (null) will attempt charset detection.
    
    Parameters:
    
    charset - character set to use, or null
    
    Since:
    
    2.4.0
  - addLinkTag
```
public void addLinkTag(String tagName,
                       String attribute)
```
  - removeLinkTag
```
public void removeLinkTag(String tagName,
                          String attribute)
```
  - clearLinkTags
```
public void clearLinkTags()
```
  - loadTextLinkExtractorFromXML
```
protected void loadTextLinkExtractorFromXML(XML xml)
```
    Description copied from class: AbstractTextLinkExtractor
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadTextLinkExtractorFromXML in class AbstractTextLinkExtractor
    
    Parameters:
    
    xml - XML configuration
  - saveTextLinkExtractorToXML
```
protected void saveTextLinkExtractorToXML(XML xml)
```
    Description copied from class: AbstractTextLinkExtractor
    
    Saves configuration settings specific to the implementing class.
    
    Specified by:
    
    saveTextLinkExtractorToXML in class AbstractTextLinkExtractor
    
    Parameters:
    
    xml - the XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractTextLinkExtractor
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractTextLinkExtractor
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractTextLinkExtractor

Modifier and Type	Field	Description
`static int`	`DEFAULT_MAX_URL_LENGTH`	Default maximum length a URL can have.
`static int`	`MAX_BUFFER_SIZE`
`static int`	`OVERLAP_SIZE`

Class HtmlLinkExtractor

Applicable documents

Tags attributes

Referrer data

Character encoding

"nofollow"

URL Fragments

Ignoring link data

URL Schemes

HTML/XML Comments

Extract links in certain parts only

XML configuration usage:

XML usage example:

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor

Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor

Methods inherited from class java.lang.Object

Field Detail

MAX_BUFFER_SIZE

OVERLAP_SIZE

DEFAULT_MAX_URL_LENGTH

Constructor Detail

HtmlLinkExtractor

Method Detail

extractTextLinks

getMaxURLLength

setMaxURLLength

getExtractBetweens

setExtractBetweens

setExtractBetweens

addExtractBetween

getNoExtractBetweens

setNoExtractBetweens

setNoExtractBetweens

addNoExtractBetween

getExtractSelectors

setExtractSelectors

setExtractSelectors

addExtractSelectors

addExtractSelectors

getNoExtractSelectors

setNoExtractSelectors

setNoExtractSelectors

addNoExtractSelectors

addNoExtractSelectors

isCommentsEnabled

setCommentsEnabled

getSchemes

setSchemes

setSchemes

isIgnoreNofollow

setIgnoreNofollow

isIgnoreLinkData

setIgnoreLinkData

getCharset

setCharset

addLinkTag

removeLinkTag

clearLinkTags

loadTextLinkExtractorFromXML

saveTextLinkExtractorToXML

equals

hashCode

toString