Class GenericCanonicalLinkDetector

java.lang.Object
com.norconex.collector.http.canon.impl.GenericCanonicalLinkDetector
All Implemented Interfaces:
ICanonicalLinkDetector, IXMLConfigurable

public class GenericCanonicalLinkDetector extends Object implements ICanonicalLinkDetector, IXMLConfigurable

Generic canonical link detector. It detects links from the HTTP headers as well as HTML files. Good canonical reference documentation can be found on this Google Webmaster Tools help page.

HTTP Headers

This detector will look for a metadata field (normally obtained from the HTTP Headers) name called "Link" with a value following this pattern:

 <http://www.example.com/sample.pdf> rel="canonical"
 

All documents will be verified for a canonical link (not just HTML).

Document content

This detector will look within the HTML <head> tags for a <link> tag following this pattern:

 <link rel="canonical" href="https://www.example.com/sample" />
 

Only HTML documents will be verified for a canonical link. By default, these content-types are considered HTML:

 text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
 

You can specify your own content types as long as they contain HTML text.

XML configuration usage:


<canonicalLinkDetector
    class="com.norconex.collector.http.canon.impl.GenericCanonicalLinkDetector"
    ignore="(false|true)">
  <contentTypes>
    (CSV list of content types on which to perform canonical link
    detection. Leave blank or remove this tag to use defaults.)
  </contentTypes>
</canonicalLinkDetector>

XML usage example:


<canonicalLinkDetector
    ignore="true"/>

The above example ignores canonical link resolution.

Since:
2.2.0
Author:
Pascal Essiembre
  • Constructor Details

    • GenericCanonicalLinkDetector

      public GenericCanonicalLinkDetector()
  • Method Details

    • getContentTypes

      public List<ContentType> getContentTypes()
    • setContentTypes

      public void setContentTypes(ContentType... contentTypes)
      Sets the content types on which to perform canonical link detection.
      Parameters:
      contentTypes - content types
    • setContentTypes

      public void setContentTypes(List<ContentType> contentTypes)
      Sets the content types on which to perform canonical link detection.
      Parameters:
      contentTypes - content types
      Since:
      3.0.0
    • detectFromMetadata

      public String detectFromMetadata(String reference, Properties metadata)
      Description copied from interface: ICanonicalLinkDetector
      Detects from metadata gathered so far, which when invoked, is normally the HTTP header values.
      Specified by:
      detectFromMetadata in interface ICanonicalLinkDetector
      Parameters:
      reference - document reference
      metadata - metadata object containing HTTP headers
      Returns:
      the detected canonical URL or null if none is found.
    • detectFromContent

      public String detectFromContent(String reference, InputStream is, ContentType contentType) throws IOException
      Description copied from interface: ICanonicalLinkDetector
      Detects from a document content the presence of a canonical URL. This occur before a document gets parsed and may apply to only a few content types.
      Specified by:
      detectFromContent in interface ICanonicalLinkDetector
      Parameters:
      reference - document reference
      is - the document content input stream
      contentType - the document content type
      Returns:
      the detected canonical URL or null if none is found.
      Throws:
      IOException - problem reading content
    • loadFromXML

      public void loadFromXML(XML xml)
      Specified by:
      loadFromXML in interface IXMLConfigurable
    • saveToXML

      public void saveToXML(XML xml)
      Specified by:
      saveToXML in interface IXMLConfigurable
    • equals

      public boolean equals(Object other)
      Overrides:
      equals in class Object
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object
    • toString

      public String toString()
      Overrides:
      toString in class Object