public class TikaLinkExtractor extends AbstractLinkExtractor
Implementation of ILinkExtractor
using
Apache Tika to perform URL
extractions from HTML documents.
This is an alternative to the HtmlLinkExtractor
.
The configuration of content-types, storing the referrer data, and ignoring
"nofollow" and ignoring link data are the same as in
HtmlLinkExtractor
. For link data, this parser only keeps a
pre-defined set of link attributes, when available (title, type,
uri, text, rel).
<extractor
class="com.norconex.collector.http.link.impl.TikaLinkExtractor"
ignoreNofollow="[false|true]">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
</extractor>
HtmlLinkExtractor
Constructor and Description |
---|
TikaLinkExtractor() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
void |
extractLinks(Set<Link> nxLinks,
CrawlDoc doc) |
int |
hashCode() |
boolean |
isIgnoreLinkData()
Gets whether to ignore extra data associated with a link.
|
boolean |
isIgnoreNofollow() |
protected void |
loadLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setIgnoreLinkData(boolean ignoreLinkData)
Sets whether to ignore extra data associated with a link.
|
void |
setIgnoreNofollow(boolean ignoreNofollow) |
String |
toString() |
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
public void extractLinks(Set<Link> nxLinks, CrawlDoc doc) throws IOException
extractLinks
in class AbstractLinkExtractor
IOException
public boolean isIgnoreNofollow()
public void setIgnoreNofollow(boolean ignoreNofollow)
public boolean isIgnoreLinkData()
true
to ignore.public void setIgnoreLinkData(boolean ignoreLinkData)
ignoreLinkData
- true
to ignore.protected void loadLinkExtractorFromXML(XML xml)
AbstractLinkExtractor
loadLinkExtractorFromXML
in class AbstractLinkExtractor
xml
- XML configurationprotected void saveLinkExtractorToXML(XML xml)
AbstractLinkExtractor
saveLinkExtractorToXML
in class AbstractLinkExtractor
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractLinkExtractor
public int hashCode()
hashCode
in class AbstractLinkExtractor
public String toString()
toString
in class AbstractLinkExtractor
Copyright © 2009–2023 Norconex Inc.. All rights reserved.