public class XMLFeedLinkExtractor extends Object implements ILinkExtractor, IXMLConfigurable
Link extractor for extracting links out of
RSS and
Atom XML feeds.
It extracts the content of <link> tags. If you need more complex
extraction, consider using RegexLinkExtractor
or creating your own
ILinkExtractor
implementation.
By default, this extractor will extract URLs only in documents having their content type being one of the following:
application/rss+xml application/rdf+xml application/atom+xml application/xml text/xml
You can specify your own content types or reference restriction patterns
using setApplyToContentTypePattern(String)
or
setApplyToReferencePattern(String)
, but make sure they
represent text files. When both methods are used, a document should be
be matched by both to be accepted. Because "text/xml" and "application/xml"
are quite generic (not specific to RSS/Atom feeds), you may want to
consider being more restrictive if that causes issues.
The following referrer information is stored as metadata in each document represented by the extracted URLs:
HttpMetadata.COLLECTOR_REFERRER_REFERENCE
.<extractor class="com.norconex.collector.http.url.impl.XMLFeedLinkExtractor"> <applyToContentTypePattern> (Regular expression matching content types this extractor should apply to. See documentation for default.) </applyToContentTypePattern> <applyToReferencePattern> (Regular expression matching references this extractor should apply to. Default accepts all references.) </applyToReferencePattern> </extractor>
The following specifies this extractor should only apply on documents that have their URL ending with "rss" (in addition to the default content types supported).
<extractor class="com.norconex.collector.http.url.impl.XMLFeedLinkExtractor"> <applyToReferencePattern>.*rss$</applyToReferencePattern> </extractor>
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_CONTENT_TYPE_PATTERN |
Constructor and Description |
---|
XMLFeedLinkExtractor() |
Modifier and Type | Method and Description |
---|---|
boolean |
accepts(String url,
ContentType contentType)
Whether this link extraction should be executed for the given URL
and/or content type.
|
boolean |
equals(Object other) |
Set<Link> |
extractLinks(InputStream input,
String reference,
ContentType contentType)
Extracts links from a document.
|
String |
getApplyToContentTypePattern() |
String |
getApplyToReferencePattern() |
int |
hashCode() |
void |
loadFromXML(Reader in) |
void |
saveToXML(Writer out) |
void |
setApplyToContentTypePattern(String applyToContentTypePattern) |
void |
setApplyToReferencePattern(String applyToReferencePattern) |
String |
toString() |
public static final String DEFAULT_CONTENT_TYPE_PATTERN
public Set<Link> extractLinks(InputStream input, String reference, ContentType contentType) throws IOException
ILinkExtractor
extractLinks
in interface ILinkExtractor
input
- the document input streamreference
- document reference (URL)contentType
- the document content typeIOException
- problem reading the documentpublic boolean accepts(String url, ContentType contentType)
ILinkExtractor
accepts
in interface ILinkExtractor
url
- the urlcontentType
- the content typetrue
if the given URL is acceptedpublic String getApplyToContentTypePattern()
public void setApplyToContentTypePattern(String applyToContentTypePattern)
public String getApplyToReferencePattern()
public void setApplyToReferencePattern(String applyToReferencePattern)
public void loadFromXML(Reader in)
loadFromXML
in interface IXMLConfigurable
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
Copyright © 2009–2021 Norconex Inc.. All rights reserved.