public class GenericCanonicalLinkDetector extends Object implements ICanonicalLinkDetector, IXMLConfigurable
Generic canonical link detector. It detects links from the HTTP headers as well as HTML files. Good canonical reference documentation can be found on this Google Webmaster Tools help page.
This detector will look for a metadata field (normally obtained from the HTTP Headers) name called "Link" with a value following this pattern:
<http://www.example.com/sample.pdf> rel="canonical"
All documents will be verified for a canonical link (not just HTML).
This detector will look within the HTML <head> tags for a <link> tag following this pattern:
<link rel="canonical" href="https://www.example.com/sample" />
Only HTML documents will be verified for a canonical link. By default, these content-types are considered HTML:
text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
You can specify your own content types as long as they contain HTML text.
<canonicalLinkDetector class="com.norconex.collector.http.url.impl.GenericCanonicalLinkDetector" ignore="(false|true)"> <contentTypes> (CSV list of content types on which to perform canonical link detection. Leave blank or remove this tag to use defaults.) </contentTypes> </canonicalLinkDetector>
The following example ignores canonical link resolution.
<canonicalLinkDetector ignore="true"/>
Constructor and Description |
---|
GenericCanonicalLinkDetector() |
Modifier and Type | Method and Description |
---|---|
String |
detectFromContent(String reference,
InputStream is,
ContentType contentType)
Detects from a document content the presence of a canonical URL.
|
String |
detectFromMetadata(String reference,
HttpMetadata metadata)
Detects from metadata gathered so far, which when invoked, is
normally the HTTP header values.
|
boolean |
equals(Object other) |
ContentType[] |
getContentTypes() |
int |
hashCode() |
void |
loadFromXML(Reader in) |
void |
saveToXML(Writer out) |
void |
setContentTypes(ContentType... contentTypes) |
String |
toString() |
public ContentType[] getContentTypes()
public void setContentTypes(ContentType... contentTypes)
public String detectFromMetadata(String reference, HttpMetadata metadata)
ICanonicalLinkDetector
detectFromMetadata
in interface ICanonicalLinkDetector
reference
- document referencemetadata
- metadata object containing HTTP headersnull
if none is found.public String detectFromContent(String reference, InputStream is, ContentType contentType) throws IOException
ICanonicalLinkDetector
detectFromContent
in interface ICanonicalLinkDetector
reference
- document referenceis
- the document content input streamcontentType
- the document content typenull
if none is found.IOException
- problem reading contentpublic void loadFromXML(Reader in) throws IOException
loadFromXML
in interface IXMLConfigurable
IOException
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
Copyright © 2009–2021 Norconex Inc.. All rights reserved.