public class TikaLinkExtractor extends Object implements ILinkExtractor, IXMLConfigurable
Implementation of ILinkExtractor
using
Apache Tika to perform URL
extractions from HTML documents.
This is an alternative to the GenericLinkExtractor
.
The configuration of content-types, storing the referrer data, and ignoring
"nofollow" are the same
as in GenericLinkExtractor
.
Since 2.3.0, this extractor preserves hashtag characters (#) found
in URLs and every characters after it. It relies on the implementation
of IURLNormalizer
to strip it if need be. Still since 2.3.0,
GenericURLNormalizer
is now always invoked by default, and the
default set of rules defined for it will remove fragments.
<extractor class="com.norconex.collector.http.url.impl.TikaLinkExtractor" ignoreNofollow="(false|true)" > <contentTypes> (CSV list of content types on which to perform link extraction. leave blank or remove tag to use defaults.) </contentTypes> </extractor>
GenericLinkExtractor
Constructor and Description |
---|
TikaLinkExtractor() |
Modifier and Type | Method and Description |
---|---|
boolean |
accepts(String url,
ContentType contentType)
Whether this link extraction should be executed for the given URL
and/or content type.
|
boolean |
equals(Object other) |
Set<Link> |
extractLinks(InputStream is,
String url,
ContentType contentType)
Extracts links from a document.
|
ContentType[] |
getContentTypes() |
int |
hashCode() |
boolean |
isIgnoreNofollow() |
boolean |
isKeepReferrerData()
Deprecated.
Since 2.6.0, referrer data is always kept
|
void |
loadFromXML(Reader in) |
void |
saveToXML(Writer out) |
void |
setContentTypes(ContentType... contentTypes) |
void |
setIgnoreNofollow(boolean ignoreNofollow) |
void |
setKeepReferrerData(boolean keepReferrerData)
Deprecated.
Since 2.6.0, referrer data is always kept
|
String |
toString() |
public Set<Link> extractLinks(InputStream is, String url, ContentType contentType) throws IOException
ILinkExtractor
extractLinks
in interface ILinkExtractor
is
- the document input streamurl
- document reference (URL)contentType
- the document content typeIOException
- problem reading the documentpublic ContentType[] getContentTypes()
public void setContentTypes(ContentType... contentTypes)
public boolean isIgnoreNofollow()
public void setIgnoreNofollow(boolean ignoreNofollow)
@Deprecated public boolean isKeepReferrerData()
true
@Deprecated public void setKeepReferrerData(boolean keepReferrerData)
keepReferrerData
- referrer datapublic boolean accepts(String url, ContentType contentType)
ILinkExtractor
accepts
in interface ILinkExtractor
url
- the urlcontentType
- the content typetrue
if the given URL is acceptedpublic void loadFromXML(Reader in)
loadFromXML
in interface IXMLConfigurable
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
Copyright © 2009–2021 Norconex Inc.. All rights reserved.