public class TikaLinkExtractor extends Object implements ILinkExtractor, IXMLConfigurable
Implementation of ILinkExtractor using
Apache Tika to perform URL
extractions from HTML documents.
This is an alternative to the GenericLinkExtractor.
The configuration of content-types, storing the referrer data, and ignoring
"nofollow" are the same
as in GenericLinkExtractor.
Since 2.3.0, this extractor preserves hashtag characters (#) found
in URLs and every characters after it. It relies on the implementation
of IURLNormalizer to strip it if need be. Still since 2.3.0,
GenericURLNormalizer is now always invoked by default, and the
default set of rules defined for it will remove fragments.
<extractor class="com.norconex.collector.http.url.impl.TikaLinkExtractor"
ignoreNofollow="(false|true)" >
<contentTypes>
(CSV list of content types on which to perform link extraction.
leave blank or remove tag to use defaults.)
</contentTypes>
</extractor>
GenericLinkExtractor| Constructor and Description |
|---|
TikaLinkExtractor() |
| Modifier and Type | Method and Description |
|---|---|
boolean |
accepts(String url,
ContentType contentType)
Whether this link extraction should be executed for the given URL
and/or content type.
|
boolean |
equals(Object other) |
Set<Link> |
extractLinks(InputStream is,
String url,
ContentType contentType)
Extracts links from a document.
|
ContentType[] |
getContentTypes() |
int |
hashCode() |
boolean |
isIgnoreNofollow() |
boolean |
isKeepReferrerData()
Deprecated.
Since 2.6.0, referrer data is always kept
|
void |
loadFromXML(Reader in) |
void |
saveToXML(Writer out) |
void |
setContentTypes(ContentType... contentTypes) |
void |
setIgnoreNofollow(boolean ignoreNofollow) |
void |
setKeepReferrerData(boolean keepReferrerData)
Deprecated.
Since 2.6.0, referrer data is always kept
|
String |
toString() |
public Set<Link> extractLinks(InputStream is, String url, ContentType contentType) throws IOException
ILinkExtractorextractLinks in interface ILinkExtractoris - the document input streamurl - document reference (URL)contentType - the document content typeIOException - problem reading the documentpublic ContentType[] getContentTypes()
public void setContentTypes(ContentType... contentTypes)
public boolean isIgnoreNofollow()
public void setIgnoreNofollow(boolean ignoreNofollow)
@Deprecated public boolean isKeepReferrerData()
true@Deprecated public void setKeepReferrerData(boolean keepReferrerData)
keepReferrerData - referrer datapublic boolean accepts(String url, ContentType contentType)
ILinkExtractoraccepts in interface ILinkExtractorurl - the urlcontentType - the content typetrue if the given URL is acceptedpublic void loadFromXML(Reader in)
loadFromXML in interface IXMLConfigurablepublic void saveToXML(Writer out) throws IOException
saveToXML in interface IXMLConfigurableIOExceptionCopyright © 2009–2021 Norconex Inc.. All rights reserved.