Class XMLFeedLinkExtractor

All Implemented Interfaces:
ILinkExtractor, IXMLConfigurable

public class XMLFeedLinkExtractor extends AbstractTextLinkExtractor

Link extractor for extracting links out of RSS and Atom XML feeds. It extracts the content of <link> tags. If you need more complex extraction, consider using RegexLinkExtractor or creating your own ILinkExtractor implementation.

Applicable documents

By default, this extractor only will be applied on documents matching one of these content types:

Referrer data

The following referrer information is stored as metadata in each document represented by the extracted URLs:

XML configuration usage:


<extractor
    class="com.norconex.collector.http.link.impl.XMLFeedLinkExtractor">
  <fieldMatcher>
    (optional expression for fields used for links extraction instead
     of the document stream)
  </fieldMatcher>
</extractor>

XML usage example:


<extractor
    class="com.norconex.collector.http.link.impl.XMLFeedLinkExtractor">
  <restrictTo
      field="document.reference"
      method="regex">
    .*rss$
  </restrictTo>
</extractor>

The above example specifies this extractor should only apply on documents that have their URL ending with "rss" (in addition to the default content types supported).

Since:
2.7.0
Author:
Pascal Essiembre