Class XMLFeedLinkExtractor

  • All Implemented Interfaces:
    ILinkExtractor, IXMLConfigurable

    public class XMLFeedLinkExtractor
    extends AbstractTextLinkExtractor

    Link extractor for extracting links out of RSS and Atom XML feeds. It extracts the content of <link> tags. If you need more complex extraction, consider using RegexLinkExtractor or creating your own ILinkExtractor implementation.

    Applicable documents

    By default, this extractor only will be applied on documents matching one of these content types:

    Referrer data

    The following referrer information is stored as metadata in each document represented by the extracted URLs:

    XML configuration usage:

    
    <extractor
        class="com.norconex.collector.http.link.impl.XMLFeedLinkExtractor">
      <fieldMatcher>
        (optional expression for fields used for links extraction instead
         of the document stream)
      </fieldMatcher>
    </extractor>

    XML usage example:

    
    <extractor
        class="com.norconex.collector.http.link.impl.XMLFeedLinkExtractor">
      <restrictTo
          field="document.reference"
          method="regex">
        .*rss$
      </restrictTo>
    </extractor>

    The above example specifies this extractor should only apply on documents that have their URL ending with "rss" (in addition to the default content types supported).

    Since:
    2.7.0
    Author:
    Pascal Essiembre