public class GenericCanonicalLinkDetector extends Object implements ICanonicalLinkDetector, IXMLConfigurable
Generic canonical link detector. It detects links from the HTTP headers as well as HTML files. Good canonical reference documentation can be found on this Google Webmaster Tools help page.
This detector will look for a metadata field (normally obtained from the HTTP Headers) name called "Link" with a value following this pattern:
<http://www.example.com/sample.pdf> rel="canonical"
All documents will be verified for a canonical link (not just HTML).
This detector will look within the HTML <head> tags for a <link> tag following this pattern:
<link rel="canonical" href="https://www.example.com/sample" />
Only HTML documents will be verified for a canonical link. By default, these content-types are considered HTML:
text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
You can specify your own content types as long as they contain HTML text.
<canonicalLinkDetector
class="com.norconex.collector.http.url.impl.GenericCanonicalLinkDetector"
ignore="(false|true)">
<contentTypes>
(CSV list of content types on which to perform canonical link
detection. Leave blank or remove this tag to use defaults.)
</contentTypes>
</canonicalLinkDetector>
The following example ignores canonical link resolution.
<canonicalLinkDetector ignore="true"/>
| Constructor and Description |
|---|
GenericCanonicalLinkDetector() |
| Modifier and Type | Method and Description |
|---|---|
String |
detectFromContent(String reference,
InputStream is,
ContentType contentType)
Detects from a document content the presence of a canonical URL.
|
String |
detectFromMetadata(String reference,
HttpMetadata metadata)
Detects from metadata gathered so far, which when invoked, is
normally the HTTP header values.
|
boolean |
equals(Object other) |
ContentType[] |
getContentTypes() |
int |
hashCode() |
void |
loadFromXML(Reader in) |
void |
saveToXML(Writer out) |
void |
setContentTypes(ContentType... contentTypes) |
String |
toString() |
public ContentType[] getContentTypes()
public void setContentTypes(ContentType... contentTypes)
public String detectFromMetadata(String reference, HttpMetadata metadata)
ICanonicalLinkDetectordetectFromMetadata in interface ICanonicalLinkDetectorreference - document referencemetadata - metadata object containing HTTP headersnull if none is found.public String detectFromContent(String reference, InputStream is, ContentType contentType) throws IOException
ICanonicalLinkDetectordetectFromContent in interface ICanonicalLinkDetectorreference - document referenceis - the document content input streamcontentType - the document content typenull if none is found.IOException - problem reading contentpublic void loadFromXML(Reader in) throws IOException
loadFromXML in interface IXMLConfigurableIOExceptionpublic void saveToXML(Writer out) throws IOException
saveToXML in interface IXMLConfigurableIOExceptionCopyright © 2009–2021 Norconex Inc.. All rights reserved.