Class XMLFeedLinkExtractor
java.lang.Object
com.norconex.collector.http.link.AbstractLinkExtractor
com.norconex.collector.http.link.AbstractTextLinkExtractor
com.norconex.collector.http.link.impl.XMLFeedLinkExtractor
- All Implemented Interfaces:
ILinkExtractor,IXMLConfigurable
Link extractor for extracting links out of
RSS and
Atom XML feeds.
It extracts the content of <link> tags. If you need more complex
extraction, consider using RegexLinkExtractor or creating your own
ILinkExtractor implementation.
Applicable documents
By default, this extractor only will be applied on documents matching one of these content types:
Referrer data
The following referrer information is stored as metadata in each document represented by the extracted URLs:
- Referrer reference: The reference (URL) of the page where the
link to a document was found. Metadata value is
HttpDocMetadata.REFERRER_REFERENCE.
XML configuration usage:
<extractor
class="com.norconex.collector.http.link.impl.XMLFeedLinkExtractor">
<fieldMatcher>
(optional expression for fields used for links extraction instead
of the document stream)
</fieldMatcher>
</extractor>
XML usage example:
<extractor
class="com.norconex.collector.http.link.impl.XMLFeedLinkExtractor">
<restrictTo
field="document.reference"
method="regex">
.*rss$
</restrictTo>
</extractor>
The above example specifies this extractor should only apply on documents that have their URL ending with "rss" (in addition to the default content types supported).
- Since:
- 2.7.0
- Author:
- Pascal Essiembre
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbooleanvoidextractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) inthashCode()protected voidLoads configuration settings specific to the implementing class.protected voidSaves configuration settings specific to the implementing class.toString()Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcherMethods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
Constructor Details
-
XMLFeedLinkExtractor
public XMLFeedLinkExtractor()
-
-
Method Details
-
extractTextLinks
- Specified by:
extractTextLinksin classAbstractTextLinkExtractor- Throws:
IOException
-
loadTextLinkExtractorFromXML
Description copied from class:AbstractTextLinkExtractorLoads configuration settings specific to the implementing class.- Specified by:
loadTextLinkExtractorFromXMLin classAbstractTextLinkExtractor- Parameters:
xml- XML configuration
-
saveTextLinkExtractorToXML
Description copied from class:AbstractTextLinkExtractorSaves configuration settings specific to the implementing class.- Specified by:
saveTextLinkExtractorToXMLin classAbstractTextLinkExtractor- Parameters:
xml- the XML
-
equals
- Overrides:
equalsin classAbstractTextLinkExtractor
-
hashCode
public int hashCode()- Overrides:
hashCodein classAbstractTextLinkExtractor
-
toString
- Overrides:
toStringin classAbstractTextLinkExtractor
-