public class GenericCanonicalLinkDetector extends Object implements ICanonicalLinkDetector, IXMLConfigurable
Generic canonical link detector. It detects links from the HTTP headers as well as HTML files. Good canonical reference documentation can be found on this Google Webmaster Tools help page.
This detector will look for a metadata field (normally obtained from the HTTP Headers) name called "Link" with a value following this pattern:
<http://www.example.com/sample.pdf> rel="canonical"
All documents will be verified for a canonical link (not just HTML).
This detector will look within the HTML <head> tags for a <link> tag following this pattern:
<link rel="canonical" href="https://www.example.com/sample" />
Only HTML documents will be verified for a canonical link. By default, these content-types are considered HTML:
text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
You can specify your own content types as long as they contain HTML text.
<canonicalLinkDetector
class="com.norconex.collector.http.canon.impl.GenericCanonicalLinkDetector"
ignore="(false|true)">
<contentTypes>
(CSV list of content types on which to perform canonical link
detection. Leave blank or remove this tag to use defaults.)
</contentTypes>
</canonicalLinkDetector>
<canonicalLinkDetector
ignore="true"/>
The above example ignores canonical link resolution.
Constructor and Description |
---|
GenericCanonicalLinkDetector() |
Modifier and Type | Method and Description |
---|---|
String |
detectFromContent(String reference,
InputStream is,
ContentType contentType)
Detects from a document content the presence of a canonical URL.
|
String |
detectFromMetadata(String reference,
Properties metadata)
Detects from metadata gathered so far, which when invoked, is
normally the HTTP header values.
|
boolean |
equals(Object other) |
List<ContentType> |
getContentTypes() |
int |
hashCode() |
void |
loadFromXML(XML xml) |
void |
saveToXML(XML xml) |
void |
setContentTypes(ContentType... contentTypes)
Sets the content types on which to perform canonical link detection.
|
void |
setContentTypes(List<ContentType> contentTypes)
Sets the content types on which to perform canonical link detection.
|
String |
toString() |
public List<ContentType> getContentTypes()
public void setContentTypes(ContentType... contentTypes)
contentTypes
- content typespublic void setContentTypes(List<ContentType> contentTypes)
contentTypes
- content typespublic String detectFromMetadata(String reference, Properties metadata)
ICanonicalLinkDetector
detectFromMetadata
in interface ICanonicalLinkDetector
reference
- document referencemetadata
- metadata object containing HTTP headersnull
if none is found.public String detectFromContent(String reference, InputStream is, ContentType contentType) throws IOException
ICanonicalLinkDetector
detectFromContent
in interface ICanonicalLinkDetector
reference
- document referenceis
- the document content input streamcontentType
- the document content typenull
if none is found.IOException
- problem reading contentpublic void loadFromXML(XML xml)
loadFromXML
in interface IXMLConfigurable
public void saveToXML(XML xml)
saveToXML
in interface IXMLConfigurable
Copyright © 2009–2023 Norconex Inc.. All rights reserved.