Class TikaLinkExtractor
java.lang.Object
com.norconex.collector.http.link.AbstractLinkExtractor
com.norconex.collector.http.link.impl.TikaLinkExtractor
- All Implemented Interfaces:
ILinkExtractor,IXMLConfigurable
Implementation of ILinkExtractor using
Apache Tika to perform URL
extractions from HTML documents.
This is an alternative to the HtmlLinkExtractor.
The configuration of content-types, storing the referrer data, and ignoring
"nofollow" and ignoring link data are the same as in
HtmlLinkExtractor. For link data, this parser only keeps a
pre-defined set of link attributes, when available (title, type,
uri, text, rel).
XML configuration usage:
<extractor
class="com.norconex.collector.http.link.impl.TikaLinkExtractor"
ignoreNofollow="[false|true]"/>- Author:
- Pascal Essiembre
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbooleanvoidextractLinks(Set<Link> nxLinks, CrawlDoc doc) inthashCode()booleanGets whether to ignore extra data associated with a link.booleanprotected voidLoads configuration settings specific to the implementing class.protected voidSaves configuration settings specific to the implementing class.voidsetIgnoreLinkData(boolean ignoreLinkData) Sets whether to ignore extra data associated with a link.voidsetIgnoreNofollow(boolean ignoreNofollow) toString()Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
Constructor Details
-
TikaLinkExtractor
public TikaLinkExtractor()
-
-
Method Details
-
extractLinks
- Specified by:
extractLinksin classAbstractLinkExtractor- Throws:
IOException
-
isIgnoreNofollow
public boolean isIgnoreNofollow() -
setIgnoreNofollow
public void setIgnoreNofollow(boolean ignoreNofollow) -
isIgnoreLinkData
public boolean isIgnoreLinkData()Gets whether to ignore extra data associated with a link.- Returns:
trueto ignore.- Since:
- 3.0.0
-
setIgnoreLinkData
public void setIgnoreLinkData(boolean ignoreLinkData) Sets whether to ignore extra data associated with a link.- Parameters:
ignoreLinkData-trueto ignore.- Since:
- 3.0.0
-
loadLinkExtractorFromXML
Description copied from class:AbstractLinkExtractorLoads configuration settings specific to the implementing class.- Specified by:
loadLinkExtractorFromXMLin classAbstractLinkExtractor- Parameters:
xml- XML configuration
-
saveLinkExtractorToXML
Description copied from class:AbstractLinkExtractorSaves configuration settings specific to the implementing class.- Specified by:
saveLinkExtractorToXMLin classAbstractLinkExtractor- Parameters:
xml- the XML
-
equals
- Overrides:
equalsin classAbstractLinkExtractor
-
hashCode
public int hashCode()- Overrides:
hashCodein classAbstractLinkExtractor
-
toString
- Overrides:
toStringin classAbstractLinkExtractor
-