Interface ICanonicalLinkDetector

All Known Implementing Classes:
GenericCanonicalLinkDetector

public interface ICanonicalLinkDetector

Detects and return any canonical URL found in documents, whether from the HTTP headers (metadata), or from a page content (usually HTML). Documents having a canonical URL reference in them are rejected in favor of the document represented by the canonical URL.

When a HttpCrawlerConfig.isFetchHttpHead() is true, a page won't be downloaded if a canonical link is found in the HTTP headers (saving bandwidth and processing). If not used, or if no canonical link was found, an attempt will be made against the HTTP headers obtained (if any) just after fetching a document. If no canonical link was found there, then the content is evaluated.

A canonical link found to be the same as the current page reference is ignored.

Since:
2.2.0
Author:
Pascal Essiembre
  • Method Details

    • detectFromMetadata

      String detectFromMetadata(String reference, Properties metadata)
      Detects from metadata gathered so far, which when invoked, is normally the HTTP header values.
      Parameters:
      reference - document reference
      metadata - metadata object containing HTTP headers
      Returns:
      the detected canonical URL or null if none is found.
    • detectFromContent

      String detectFromContent(String reference, InputStream is, ContentType contentType) throws IOException
      Detects from a document content the presence of a canonical URL. This occur before a document gets parsed and may apply to only a few content types.
      Parameters:
      reference - document reference
      is - the document content input stream
      contentType - the document content type
      Returns:
      the detected canonical URL or null if none is found.
      Throws:
      IOException - problem reading content