Class XMLFeedLinkExtractor
- java.lang.Object
-
- com.norconex.collector.http.link.AbstractLinkExtractor
-
- com.norconex.collector.http.link.AbstractTextLinkExtractor
-
- com.norconex.collector.http.link.impl.XMLFeedLinkExtractor
-
- All Implemented Interfaces:
ILinkExtractor
,IXMLConfigurable
public class XMLFeedLinkExtractor extends AbstractTextLinkExtractor
Link extractor for extracting links out of RSS and Atom XML feeds. It extracts the content of <link> tags. If you need more complex extraction, consider using
RegexLinkExtractor
or creating your ownILinkExtractor
implementation.Applicable documents
By default, this extractor only will be applied on documents matching one of these content types:
Referrer data
The following referrer information is stored as metadata in each document represented by the extracted URLs:
- Referrer reference: The reference (URL) of the page where the
link to a document was found. Metadata value is
HttpDocMetadata.REFERRER_REFERENCE
.
XML configuration usage:
<extractor class="com.norconex.collector.http.link.impl.XMLFeedLinkExtractor"> <fieldMatcher> (optional expression for fields used for links extraction instead of the document stream) </fieldMatcher> </extractor>
XML usage example:
<extractor class="com.norconex.collector.http.link.impl.XMLFeedLinkExtractor"> <restrictTo field="document.reference" method="regex"> .*rss$ </restrictTo> </extractor>
The above example specifies this extractor should only apply on documents that have their URL ending with "rss" (in addition to the default content types supported).
- Since:
- 2.7.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description XMLFeedLinkExtractor()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
void
extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader)
int
hashCode()
protected void
loadTextLinkExtractorFromXML(XML xml)
Loads configuration settings specific to the implementing class.protected void
saveTextLinkExtractorToXML(XML xml)
Saves configuration settings specific to the implementing class.String
toString()
-
Methods inherited from class com.norconex.collector.http.link.AbstractTextLinkExtractor
extractLinks, getFieldMatcher, loadLinkExtractorFromXML, saveLinkExtractorToXML, setFieldMatcher
-
Methods inherited from class com.norconex.collector.http.link.AbstractLinkExtractor
addRestriction, addRestrictions, clearRestrictions, extractLinks, getRestrictions, loadFromXML, removeRestriction, removeRestriction, saveToXML, setRestrictions
-
-
-
-
Method Detail
-
extractTextLinks
public void extractTextLinks(Set<Link> links, HandlerDoc doc, Reader reader) throws IOException
- Specified by:
extractTextLinks
in classAbstractTextLinkExtractor
- Throws:
IOException
-
loadTextLinkExtractorFromXML
protected void loadTextLinkExtractorFromXML(XML xml)
Description copied from class:AbstractTextLinkExtractor
Loads configuration settings specific to the implementing class.- Specified by:
loadTextLinkExtractorFromXML
in classAbstractTextLinkExtractor
- Parameters:
xml
- XML configuration
-
saveTextLinkExtractorToXML
protected void saveTextLinkExtractorToXML(XML xml)
Description copied from class:AbstractTextLinkExtractor
Saves configuration settings specific to the implementing class.- Specified by:
saveTextLinkExtractorToXML
in classAbstractTextLinkExtractor
- Parameters:
xml
- the XML
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractTextLinkExtractor
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractTextLinkExtractor
-
toString
public String toString()
- Overrides:
toString
in classAbstractTextLinkExtractor
-
-