Class GenericCanonicalLinkDetector

  • All Implemented Interfaces:
    ICanonicalLinkDetector, IXMLConfigurable

    public class GenericCanonicalLinkDetector
    extends Object
    implements ICanonicalLinkDetector, IXMLConfigurable

    Generic canonical link detector. It detects links from the HTTP headers as well as HTML files. Good canonical reference documentation can be found on this Google Webmaster Tools help page.

    HTTP Headers

    This detector will look for a metadata field (normally obtained from the HTTP Headers) name called "Link" with a value following this pattern:

     <http://www.example.com/sample.pdf> rel="canonical"
     

    All documents will be verified for a canonical link (not just HTML).

    Document content

    This detector will look within the HTML <head> tags for a <link> tag following this pattern:

     <link rel="canonical" href="https://www.example.com/sample" />
     

    Only HTML documents will be verified for a canonical link. By default, these content-types are considered HTML:

     text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
     

    You can specify your own content types as long as they contain HTML text.

    XML configuration usage:

    
    <canonicalLinkDetector
        class="com.norconex.collector.http.canon.impl.GenericCanonicalLinkDetector"
        ignore="(false|true)">
      <contentTypes>
        (CSV list of content types on which to perform canonical link
        detection. Leave blank or remove this tag to use defaults.)
      </contentTypes>
    </canonicalLinkDetector>

    XML usage example:

    
    <canonicalLinkDetector
        ignore="true"/>

    The above example ignores canonical link resolution.

    Since:
    2.2.0
    Author:
    Pascal Essiembre
    • Constructor Detail

      • GenericCanonicalLinkDetector

        public GenericCanonicalLinkDetector()
    • Method Detail

      • setContentTypes

        public void setContentTypes​(ContentType... contentTypes)
        Sets the content types on which to perform canonical link detection.
        Parameters:
        contentTypes - content types
      • setContentTypes

        public void setContentTypes​(List<ContentType> contentTypes)
        Sets the content types on which to perform canonical link detection.
        Parameters:
        contentTypes - content types
        Since:
        3.0.0
      • detectFromMetadata

        public String detectFromMetadata​(String reference,
                                         Properties metadata)
        Description copied from interface: ICanonicalLinkDetector
        Detects from metadata gathered so far, which when invoked, is normally the HTTP header values.
        Specified by:
        detectFromMetadata in interface ICanonicalLinkDetector
        Parameters:
        reference - document reference
        metadata - metadata object containing HTTP headers
        Returns:
        the detected canonical URL or null if none is found.
      • detectFromContent

        public String detectFromContent​(String reference,
                                        InputStream is,
                                        ContentType contentType)
                                 throws IOException
        Description copied from interface: ICanonicalLinkDetector
        Detects from a document content the presence of a canonical URL. This occur before a document gets parsed and may apply to only a few content types.
        Specified by:
        detectFromContent in interface ICanonicalLinkDetector
        Parameters:
        reference - document reference
        is - the document content input stream
        contentType - the document content type
        Returns:
        the detected canonical URL or null if none is found.
        Throws:
        IOException - problem reading content
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class Object