Class GenericCanonicalLinkDetector
- All Implemented Interfaces:
ICanonicalLinkDetector,IXMLConfigurable
Generic canonical link detector. It detects links from the HTTP headers as well as HTML files. Good canonical reference documentation can be found on this Google Webmaster Tools help page.
HTTP Headers
This detector will look for a metadata field (normally obtained from the HTTP Headers) name called "Link" with a value following this pattern:
<http://www.example.com/sample.pdf> rel="canonical"
All documents will be verified for a canonical link (not just HTML).
Document content
This detector will look within the HTML <head> tags for a <link> tag following this pattern:
<link rel="canonical" href="https://www.example.com/sample" />
Only HTML documents will be verified for a canonical link. By default, these content-types are considered HTML:
text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
You can specify your own content types as long as they contain HTML text.
XML configuration usage:
<canonicalLinkDetector
class="com.norconex.collector.http.canon.impl.GenericCanonicalLinkDetector"
ignore="(false|true)">
<contentTypes>
(CSV list of content types on which to perform canonical link
detection. Leave blank or remove this tag to use defaults.)
</contentTypes>
</canonicalLinkDetector>
XML usage example:
<canonicalLinkDetector
ignore="true"/>
The above example ignores canonical link resolution.
- Since:
- 2.2.0
- Author:
- Pascal Essiembre
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptiondetectFromContent(String reference, InputStream is, ContentType contentType) Detects from a document content the presence of a canonical URL.detectFromMetadata(String reference, Properties metadata) Detects from metadata gathered so far, which when invoked, is normally the HTTP header values.booleaninthashCode()voidloadFromXML(XML xml) voidvoidsetContentTypes(ContentType... contentTypes) Sets the content types on which to perform canonical link detection.voidsetContentTypes(List<ContentType> contentTypes) Sets the content types on which to perform canonical link detection.toString()
-
Constructor Details
-
GenericCanonicalLinkDetector
public GenericCanonicalLinkDetector()
-
-
Method Details
-
getContentTypes
-
setContentTypes
Sets the content types on which to perform canonical link detection.- Parameters:
contentTypes- content types
-
setContentTypes
Sets the content types on which to perform canonical link detection.- Parameters:
contentTypes- content types- Since:
- 3.0.0
-
detectFromMetadata
Description copied from interface:ICanonicalLinkDetectorDetects from metadata gathered so far, which when invoked, is normally the HTTP header values.- Specified by:
detectFromMetadatain interfaceICanonicalLinkDetector- Parameters:
reference- document referencemetadata- metadata object containing HTTP headers- Returns:
- the detected canonical URL or
nullif none is found.
-
detectFromContent
public String detectFromContent(String reference, InputStream is, ContentType contentType) throws IOException Description copied from interface:ICanonicalLinkDetectorDetects from a document content the presence of a canonical URL. This occur before a document gets parsed and may apply to only a few content types.- Specified by:
detectFromContentin interfaceICanonicalLinkDetector- Parameters:
reference- document referenceis- the document content input streamcontentType- the document content type- Returns:
- the detected canonical URL or
nullif none is found. - Throws:
IOException- problem reading content
-
loadFromXML
- Specified by:
loadFromXMLin interfaceIXMLConfigurable
-
saveToXML
- Specified by:
saveToXMLin interfaceIXMLConfigurable
-
equals
-
hashCode
public int hashCode() -
toString
-