Package com.norconex.collector.http.link
Class AbstractTextLinkExtractor
java.lang.Object
com.norconex.collector.http.link.AbstractLinkExtractor
com.norconex.collector.http.link.AbstractTextLinkExtractor
- All Implemented Interfaces:
ILinkExtractor,IXMLConfigurable
- Direct Known Subclasses:
DOMLinkExtractor,HtmlLinkExtractor,RegexLinkExtractor,XMLFeedLinkExtractor
Base class for link extraction from text documents, providing common configuration settings such as being able to apply extraction to specific documents only, and being able to specify one or more metadata fields from which to grab the text for extracting links.
Not suitable for binary files.
Subclasses inherit the following:
XML configuration usage:
<fieldMatcher>
(optional expression for fields used for links extraction instead
of the document stream)
</fieldMatcher>
XML usage example:
The above will apply to any content type starting with "text/".
- Since:
- 3.0.0
- Author:
- Pascal Essiembre
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbooleanfinal voidextractLinks(Set<Link> links, CrawlDoc doc) abstract voidextractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) Gets field matcher identifying fields holding content used for link extraction.inthashCode()final voidLoads configuration settings specific to the implementing class.protected abstract voidLoads configuration settings specific to the implementing class.protected final voidSaves configuration settings specific to the implementing class.protected abstract voidSaves configuration settings specific to the implementing class.voidsetFieldMatcher(TextMatcher fieldMatcher) Gets field matcher identifying fields holding content used for link extraction.toString()Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
Constructor Details
-
AbstractTextLinkExtractor
public AbstractTextLinkExtractor()
-
-
Method Details
-
extractLinks
- Specified by:
extractLinksin classAbstractLinkExtractor- Throws:
IOException
-
extractTextLinks
public abstract void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException - Throws:
IOException
-
getFieldMatcher
Gets field matcher identifying fields holding content used for link extraction. Default isnull, using the document content stream instead.- Returns:
- field matcher
-
setFieldMatcher
Gets field matcher identifying fields holding content used for link extraction. Default isnull, using the document content stream instead.- Parameters:
fieldMatcher- field matcher
-
loadLinkExtractorFromXML
Description copied from class:AbstractLinkExtractorLoads configuration settings specific to the implementing class.- Specified by:
loadLinkExtractorFromXMLin classAbstractLinkExtractor- Parameters:
xml- XML configuration
-
loadTextLinkExtractorFromXML
Loads configuration settings specific to the implementing class.- Parameters:
xml- XML configuration
-
saveLinkExtractorToXML
Description copied from class:AbstractLinkExtractorSaves configuration settings specific to the implementing class.- Specified by:
saveLinkExtractorToXMLin classAbstractLinkExtractor- Parameters:
xml- the XML
-
saveTextLinkExtractorToXML
Saves configuration settings specific to the implementing class.- Parameters:
xml- the XML
-
equals
- Overrides:
equalsin classAbstractLinkExtractor
-
hashCode
public int hashCode()- Overrides:
hashCodein classAbstractLinkExtractor
-
toString
- Overrides:
toStringin classAbstractLinkExtractor
-