Class TikaLinkExtractor
- java.lang.Object
-
- com.norconex.collector.http.link.AbstractLinkExtractor
-
- com.norconex.collector.http.link.impl.TikaLinkExtractor
-
- All Implemented Interfaces:
ILinkExtractor
,IXMLConfigurable
public class TikaLinkExtractor extends AbstractLinkExtractor
Implementation of
ILinkExtractor
using Apache Tika to perform URL extractions from HTML documents. This is an alternative to theHtmlLinkExtractor
.The configuration of content-types, storing the referrer data, and ignoring "nofollow" and ignoring link data are the same as in
HtmlLinkExtractor
. For link data, this parser only keeps a pre-defined set of link attributes, when available (title, type, uri, text, rel).XML configuration usage:
<extractor class="com.norconex.collector.http.link.impl.TikaLinkExtractor" ignoreNofollow="[false|true]"/>
- Author:
- Pascal Essiembre
- See Also:
HtmlLinkExtractor
-
-
Constructor Summary
Constructors Constructor Description TikaLinkExtractor()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
void
extractLinks(Set<Link> nxLinks, CrawlDoc doc)
int
hashCode()
boolean
isIgnoreLinkData()
Gets whether to ignore extra data associated with a link.boolean
isIgnoreNofollow()
protected void
loadLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected void
saveLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setIgnoreLinkData(boolean ignoreLinkData)
Sets whether to ignore extra data associated with a link.void
setIgnoreNofollow(boolean ignoreNofollow)
String
toString()
-
Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
-
-
-
Method Detail
-
extractLinks
public void extractLinks(Set<Link> nxLinks, CrawlDoc doc) throws IOException
- Specified by:
extractLinks
in classAbstractLinkExtractor
- Throws:
IOException
-
isIgnoreNofollow
public boolean isIgnoreNofollow()
-
setIgnoreNofollow
public void setIgnoreNofollow(boolean ignoreNofollow)
-
isIgnoreLinkData
public boolean isIgnoreLinkData()
Gets whether to ignore extra data associated with a link.- Returns:
true
to ignore.- Since:
- 3.0.0
-
setIgnoreLinkData
public void setIgnoreLinkData(boolean ignoreLinkData)
Sets whether to ignore extra data associated with a link.- Parameters:
ignoreLinkData
-true
to ignore.- Since:
- 3.0.0
-
loadLinkExtractorFromXML
protected void loadLinkExtractorFromXML(XML xml)
Description copied from class:AbstractLinkExtractor
Loads configuration settings specific to the implementing class.- Specified by:
loadLinkExtractorFromXML
in classAbstractLinkExtractor
- Parameters:
xml
- XML configuration
-
saveLinkExtractorToXML
protected void saveLinkExtractorToXML(XML xml)
Description copied from class:AbstractLinkExtractor
Saves configuration settings specific to the implementing class.- Specified by:
saveLinkExtractorToXML
in classAbstractLinkExtractor
- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractLinkExtractor
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractLinkExtractor
-
toString
public String toString()
- Overrides:
toString
in classAbstractLinkExtractor
-
-