Package com.norconex.collector.http.link
Class AbstractTextLinkExtractor
- java.lang.Object
-
- com.norconex.collector.http.link.AbstractLinkExtractor
-
- com.norconex.collector.http.link.AbstractTextLinkExtractor
-
- All Implemented Interfaces:
ILinkExtractor,IXMLConfigurable
- Direct Known Subclasses:
DOMLinkExtractor,HtmlLinkExtractor,RegexLinkExtractor,XMLFeedLinkExtractor
public abstract class AbstractTextLinkExtractor extends AbstractLinkExtractor
Base class for link extraction from text documents, providing common configuration settings such as being able to apply extraction to specific documents only, and being able to specify one or more metadata fields from which to grab the text for extracting links.
Not suitable for binary files.
Subclasses inherit the following:
XML configuration usage:
<fieldMatcher> (optional expression for fields used for links extraction instead of the document stream) </fieldMatcher>XML usage example:
The above will apply to any content type starting with "text/".
- Since:
- 3.0.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description AbstractTextLinkExtractor()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description booleanequals(Object other)voidextractLinks(Set<Link> links, CrawlDoc doc)abstract voidextractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader)TextMatchergetFieldMatcher()Gets field matcher identifying fields holding content used for link extraction.inthashCode()voidloadLinkExtractorFromXML(XML xml)Loads configuration settings specific to the implementing class.protected abstract voidloadTextLinkExtractorFromXML(XML xml)Loads configuration settings specific to the implementing class.protected voidsaveLinkExtractorToXML(XML xml)Saves configuration settings specific to the implementing class.protected abstract voidsaveTextLinkExtractorToXML(XML xml)Saves configuration settings specific to the implementing class.voidsetFieldMatcher(TextMatcher fieldMatcher)Gets field matcher identifying fields holding content used for link extraction.StringtoString()-
Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
-
-
-
Method Detail
-
extractLinks
public final void extractLinks(Set<Link> links, CrawlDoc doc) throws IOException
- Specified by:
extractLinksin classAbstractLinkExtractor- Throws:
IOException
-
extractTextLinks
public abstract void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
- Throws:
IOException
-
getFieldMatcher
public TextMatcher getFieldMatcher()
Gets field matcher identifying fields holding content used for link extraction. Default isnull, using the document content stream instead.- Returns:
- field matcher
-
setFieldMatcher
public void setFieldMatcher(TextMatcher fieldMatcher)
Gets field matcher identifying fields holding content used for link extraction. Default isnull, using the document content stream instead.- Parameters:
fieldMatcher- field matcher
-
loadLinkExtractorFromXML
public final void loadLinkExtractorFromXML(XML xml)
Description copied from class:AbstractLinkExtractorLoads configuration settings specific to the implementing class.- Specified by:
loadLinkExtractorFromXMLin classAbstractLinkExtractor- Parameters:
xml- XML configuration
-
loadTextLinkExtractorFromXML
protected abstract void loadTextLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.- Parameters:
xml- XML configuration
-
saveLinkExtractorToXML
protected final void saveLinkExtractorToXML(XML xml)
Description copied from class:AbstractLinkExtractorSaves configuration settings specific to the implementing class.- Specified by:
saveLinkExtractorToXMLin classAbstractLinkExtractor- Parameters:
xml- the XML
-
saveTextLinkExtractorToXML
protected abstract void saveTextLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.- Parameters:
xml- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equalsin classAbstractLinkExtractor
-
hashCode
public int hashCode()
- Overrides:
hashCodein classAbstractLinkExtractor
-
toString
public String toString()
- Overrides:
toStringin classAbstractLinkExtractor
-
-