Interface ICanonicalLinkDetector

  • All Known Implementing Classes:
    GenericCanonicalLinkDetector

    public interface ICanonicalLinkDetector

    Detects and return any canonical URL found in documents, whether from the HTTP headers (metadata), or from a page content (usually HTML). Documents having a canonical URL reference in them are rejected in favor of the document represented by the canonical URL.

    When a HttpCrawlerConfig.isFetchHttpHead() is true, a page won't be downloaded if a canonical link is found in the HTTP headers (saving bandwidth and processing). If not used, or if no canonical link was found, an attempt will be made against the HTTP headers obtained (if any) just after fetching a document. If no canonical link was found there, then the content is evaluated.

    A canonical link found to be the same as the current page reference is ignored.

    Since:
    2.2.0
    Author:
    Pascal Essiembre
    • Method Detail

      • detectFromMetadata

        String detectFromMetadata​(String reference,
                                  Properties metadata)
        Detects from metadata gathered so far, which when invoked, is normally the HTTP header values.
        Parameters:
        reference - document reference
        metadata - metadata object containing HTTP headers
        Returns:
        the detected canonical URL or null if none is found.
      • detectFromContent

        String detectFromContent​(String reference,
                                 InputStream is,
                                 ContentType contentType)
                          throws IOException
        Detects from a document content the presence of a canonical URL. This occur before a document gets parsed and may apply to only a few content types.
        Parameters:
        reference - document reference
        is - the document content input stream
        contentType - the document content type
        Returns:
        the detected canonical URL or null if none is found.
        Throws:
        IOException - problem reading content