Class TikaLinkExtractor
- java.lang.Object
-
- com.norconex.collector.http.link.AbstractLinkExtractor
-
- com.norconex.collector.http.link.impl.TikaLinkExtractor
-
- All Implemented Interfaces:
ILinkExtractor,IXMLConfigurable
public class TikaLinkExtractor extends AbstractLinkExtractor
Implementation of
ILinkExtractorusing Apache Tika to perform URL extractions from HTML documents. This is an alternative to theHtmlLinkExtractor.The configuration of content-types, storing the referrer data, and ignoring "nofollow" and ignoring link data are the same as in
HtmlLinkExtractor. For link data, this parser only keeps a pre-defined set of link attributes, when available (title, type, uri, text, rel).XML configuration usage:
<extractor class="com.norconex.collector.http.link.impl.TikaLinkExtractor" ignoreNofollow="[false|true]"/>- Author:
- Pascal Essiembre
- See Also:
HtmlLinkExtractor
-
-
Constructor Summary
Constructors Constructor Description TikaLinkExtractor()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description booleanequals(Object other)voidextractLinks(Set<Link> nxLinks, CrawlDoc doc)inthashCode()booleanisIgnoreLinkData()Gets whether to ignore extra data associated with a link.booleanisIgnoreNofollow()protected voidloadLinkExtractorFromXML(XML xml)Loads configuration settings specific to the implementing class.protected voidsaveLinkExtractorToXML(XML xml)Saves configuration settings specific to the implementing class.voidsetIgnoreLinkData(boolean ignoreLinkData)Sets whether to ignore extra data associated with a link.voidsetIgnoreNofollow(boolean ignoreNofollow)StringtoString()-
Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
-
-
-
Method Detail
-
extractLinks
public void extractLinks(Set<Link> nxLinks, CrawlDoc doc) throws IOException
- Specified by:
extractLinksin classAbstractLinkExtractor- Throws:
IOException
-
isIgnoreNofollow
public boolean isIgnoreNofollow()
-
setIgnoreNofollow
public void setIgnoreNofollow(boolean ignoreNofollow)
-
isIgnoreLinkData
public boolean isIgnoreLinkData()
Gets whether to ignore extra data associated with a link.- Returns:
trueto ignore.- Since:
- 3.0.0
-
setIgnoreLinkData
public void setIgnoreLinkData(boolean ignoreLinkData)
Sets whether to ignore extra data associated with a link.- Parameters:
ignoreLinkData-trueto ignore.- Since:
- 3.0.0
-
loadLinkExtractorFromXML
protected void loadLinkExtractorFromXML(XML xml)
Description copied from class:AbstractLinkExtractorLoads configuration settings specific to the implementing class.- Specified by:
loadLinkExtractorFromXMLin classAbstractLinkExtractor- Parameters:
xml- XML configuration
-
saveLinkExtractorToXML
protected void saveLinkExtractorToXML(XML xml)
Description copied from class:AbstractLinkExtractorSaves configuration settings specific to the implementing class.- Specified by:
saveLinkExtractorToXMLin classAbstractLinkExtractor- Parameters:
xml- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equalsin classAbstractLinkExtractor
-
hashCode
public int hashCode()
- Overrides:
hashCodein classAbstractLinkExtractor
-
toString
public String toString()
- Overrides:
toStringin classAbstractLinkExtractor
-
-