public abstract class AbstractTextLinkExtractor extends AbstractLinkExtractor
Base class for link extraction from text documents, providing common configuration settings such as being able to apply extraction to specific documents only, and being able to specify one or more metadata fields from which to grab the text for extracting links.
Not suitable for binary files.
Subclasses inherit the following:
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(optional expression for fields used for links extraction instead
of the document stream)
</fieldMatcher>
<restrictTo>
<fieldMatcher>document.contentType</fieldMatcher>
<valueMatcher
method="wildcard">
text/*
</valueMatcher>
</restrictTo>
The above will apply to any content type starting with "text/".
Constructor and Description |
---|
AbstractTextLinkExtractor() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
void |
extractLinks(Set<Link> links,
CrawlDoc doc) |
abstract void |
extractTextLinks(Set<Link> links,
HandlerDoc doc,
Reader reader) |
TextMatcher |
getFieldMatcher()
Gets field matcher identifying fields holding content used for
link extraction.
|
int |
hashCode() |
void |
loadLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected abstract void |
loadTextLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
protected abstract void |
saveTextLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setFieldMatcher(TextMatcher fieldMatcher)
Gets field matcher identifying fields holding content used for
link extraction.
|
String |
toString() |
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
public final void extractLinks(Set<Link> links, CrawlDoc doc) throws IOException
extractLinks
in class AbstractLinkExtractor
IOException
public abstract void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
IOException
public TextMatcher getFieldMatcher()
null
, using the document
content stream instead.public void setFieldMatcher(TextMatcher fieldMatcher)
null
, using the document
content stream instead.fieldMatcher
- field matcherpublic final void loadLinkExtractorFromXML(XML xml)
AbstractLinkExtractor
loadLinkExtractorFromXML
in class AbstractLinkExtractor
xml
- XML configurationprotected abstract void loadTextLinkExtractorFromXML(XML xml)
xml
- XML configurationprotected final void saveLinkExtractorToXML(XML xml)
AbstractLinkExtractor
saveLinkExtractorToXML
in class AbstractLinkExtractor
xml
- the XMLprotected abstract void saveTextLinkExtractorToXML(XML xml)
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractLinkExtractor
public int hashCode()
hashCode
in class AbstractLinkExtractor
public String toString()
toString
in class AbstractLinkExtractor
Copyright © 2009–2023 Norconex Inc.. All rights reserved.