Package com.norconex.collector.http.link
Class AbstractTextLinkExtractor
- java.lang.Object
-
- com.norconex.collector.http.link.AbstractLinkExtractor
-
- com.norconex.collector.http.link.AbstractTextLinkExtractor
-
- All Implemented Interfaces:
ILinkExtractor
,IXMLConfigurable
- Direct Known Subclasses:
DOMLinkExtractor
,HtmlLinkExtractor
,RegexLinkExtractor
,XMLFeedLinkExtractor
public abstract class AbstractTextLinkExtractor extends AbstractLinkExtractor
Base class for link extraction from text documents, providing common configuration settings such as being able to apply extraction to specific documents only, and being able to specify one or more metadata fields from which to grab the text for extracting links.
Not suitable for binary files.
Subclasses inherit the following:
XML configuration usage:
<fieldMatcher> (optional expression for fields used for links extraction instead of the document stream) </fieldMatcher>
XML usage example:
The above will apply to any content type starting with "text/".
- Since:
- 3.0.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description AbstractTextLinkExtractor()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
void
extractLinks(Set<Link> links, CrawlDoc doc)
abstract void
extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader)
TextMatcher
getFieldMatcher()
Gets field matcher identifying fields holding content used for link extraction.int
hashCode()
void
loadLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected abstract void
loadTextLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected void
saveLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.protected abstract void
saveTextLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.void
setFieldMatcher(TextMatcher fieldMatcher)
Gets field matcher identifying fields holding content used for link extraction.String
toString()
-
Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
-
-
-
Method Detail
-
extractLinks
public final void extractLinks(Set<Link> links, CrawlDoc doc) throws IOException
- Specified by:
extractLinks
in classAbstractLinkExtractor
- Throws:
IOException
-
extractTextLinks
public abstract void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
- Throws:
IOException
-
getFieldMatcher
public TextMatcher getFieldMatcher()
Gets field matcher identifying fields holding content used for link extraction. Default isnull
, using the document content stream instead.- Returns:
- field matcher
-
setFieldMatcher
public void setFieldMatcher(TextMatcher fieldMatcher)
Gets field matcher identifying fields holding content used for link extraction. Default isnull
, using the document content stream instead.- Parameters:
fieldMatcher
- field matcher
-
loadLinkExtractorFromXML
public final void loadLinkExtractorFromXML(XML xml)
Description copied from class:AbstractLinkExtractor
Loads configuration settings specific to the implementing class.- Specified by:
loadLinkExtractorFromXML
in classAbstractLinkExtractor
- Parameters:
xml
- XML configuration
-
loadTextLinkExtractorFromXML
protected abstract void loadTextLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.- Parameters:
xml
- XML configuration
-
saveLinkExtractorToXML
protected final void saveLinkExtractorToXML(XML xml)
Description copied from class:AbstractLinkExtractor
Saves configuration settings specific to the implementing class.- Specified by:
saveLinkExtractorToXML
in classAbstractLinkExtractor
- Parameters:
xml
- the XML
-
saveTextLinkExtractorToXML
protected abstract void saveTextLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractLinkExtractor
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractLinkExtractor
-
toString
public String toString()
- Overrides:
toString
in classAbstractLinkExtractor
-
-