XMLFeedLinkExtractor (Norconex HTTP Collector 2.9.1 API)

java.lang.Object
- com.norconex.collector.http.url.impl.XMLFeedLinkExtractor

All Implemented Interfaces:

ILinkExtractor, IXMLConfigurable
```
public class XMLFeedLinkExtractor
extends Object
implements ILinkExtractor, IXMLConfigurable
```
Link extractor for extracting links out of RSS and Atom XML feeds. It extracts the content of <link> tags. If you need more complex extraction, consider using RegexLinkExtractor or creating your own ILinkExtractor implementation.

Applicable documents

By default, this extractor will extract URLs only in documents having their content type being one of the following:
```
 application/rss+xml
 application/rdf+xml 
 application/atom+xml
 application/xml
 text/xml
 
```
You can specify your own content types or reference restriction patterns using setApplyToContentTypePattern(String) or setApplyToReferencePattern(String), but make sure they represent text files. When both methods are used, a document should be be matched by both to be accepted. Because "text/xml" and "application/xml" are quite generic (not specific to RSS/Atom feeds), you may want to consider being more restrictive if that causes issues.

Referrer data

The following referrer information is stored as metadata in each document represented by the extracted URLs:
- Referrer reference: The reference (URL) of the page where the link to a document was found. Metadata value is HttpMetadata.COLLECTOR_REFERRER_REFERENCE.
XML configuration usage:
```
  <extractor class="com.norconex.collector.http.url.impl.XMLFeedLinkExtractor">
      <applyToContentTypePattern>
          (Regular expression matching content types this extractor 
           should apply to. See documentation for default.)
      </applyToContentTypePattern>
      <applyToReferencePattern>
          (Regular expression matching references this extractor should
           apply to. Default accepts all references.)
      </applyToReferencePattern>
  </extractor>
 
```
Usage example:

The following specifies this extractor should only apply on documents that have their URL ending with "rss" (in addition to the default content types supported).
```
  <extractor class="com.norconex.collector.http.url.impl.XMLFeedLinkExtractor">
      <applyToReferencePattern>.*rss$</applyToReferencePattern>
  </extractor>
 
```
Since:

2.7.0

Author:

Pascal Essiembre

Field Summary

Fields
Modifier and Type Field and Description

static String DEFAULT_CONTENT_TYPE_PATTERN

Fields
Modifier and Type	Field and Description
`static String`	`DEFAULT_CONTENT_TYPE_PATTERN`

Constructor Summary

Constructors
Constructor and Description

XMLFeedLinkExtractor()

Constructors
Constructor and Description
`XMLFeedLinkExtractor()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`accepts(String url, ContentType contentType)` Whether this link extraction should be executed for the given URL and/or content type.
`boolean`	`equals(Object other)`
`Set<Link>`	`extractLinks(InputStream input, String reference, ContentType contentType)` Extracts links from a document.
`String`	`getApplyToContentTypePattern()`
`String`	`getApplyToReferencePattern()`
`int`	`hashCode()`
`void`	`loadFromXML(Reader in)`
`void`	`saveToXML(Writer out)`
`void`	`setApplyToContentTypePattern(String applyToContentTypePattern)`
`void`	`setApplyToReferencePattern(String applyToReferencePattern)`
`String`	`toString()`

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Field Detail
  - DEFAULT_CONTENT_TYPE_PATTERN
```
public static final String DEFAULT_CONTENT_TYPE_PATTERN
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - XMLFeedLinkExtractor
```
public XMLFeedLinkExtractor()
```
- Method Detail
  - extractLinks
```
public Set<Link> extractLinks(InputStream input,
                              String reference,
                              ContentType contentType)
                       throws IOException
```
    Description copied from interface: ILinkExtractor
    
    Extracts links from a document.
    
    Specified by:
    
    extractLinks in interface ILinkExtractor
    
    Parameters:
    
    input - the document input stream
    
    reference - document reference (URL)
    
    contentType - the document content type
    
    Returns:
    
    a set of links
    
    Throws:
    
    IOException - problem reading the document
  - accepts
```
public boolean accepts(String url,
                       ContentType contentType)
```
    Description copied from interface: ILinkExtractor
    
    Whether this link extraction should be executed for the given URL and/or content type.
    
    Specified by:
    
    accepts in interface ILinkExtractor
    
    Parameters:
    
    url - the url
    
    contentType - the content type
    
    Returns:
    
    true if the given URL is accepted
  - getApplyToContentTypePattern
```
public String getApplyToContentTypePattern()
```
  - setApplyToContentTypePattern
```
public void setApplyToContentTypePattern(String applyToContentTypePattern)
```
  - getApplyToReferencePattern
```
public String getApplyToReferencePattern()
```
  - setApplyToReferencePattern
```
public void setApplyToReferencePattern(String applyToReferencePattern)
```
  - loadFromXML
```
public void loadFromXML(Reader in)
```
    Specified by:
    
    loadFromXML in interface IXMLConfigurable
  - saveToXML
```
public void saveToXML(Writer out)
               throws IOException
```
    Specified by:
    
    saveToXML in interface IXMLConfigurable
    
    Throws:
    
    IOException
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class Object
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class Object

Class XMLFeedLinkExtractor

Applicable documents

Referrer data

XML configuration usage:

Usage example:

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_CONTENT_TYPE_PATTERN

Constructor Detail

XMLFeedLinkExtractor

Method Detail

extractLinks

accepts

getApplyToContentTypePattern

setApplyToContentTypePattern

getApplyToReferencePattern

setApplyToReferencePattern

loadFromXML

saveToXML

toString

equals

hashCode