ICanonicalLinkDetector (Norconex HTTP Collector 3.1.0 API)

All Known Implementing Classes:

GenericCanonicalLinkDetector
```
public interface ICanonicalLinkDetector
```
Detects and return any canonical URL found in documents, whether from the HTTP headers (metadata), or from a page content (usually HTML). Documents having a canonical URL reference in them are rejected in favor of the document represented by the canonical URL.

When a HttpCrawlerConfig.isFetchHttpHead() is true, a page won't be downloaded if a canonical link is found in the HTTP headers (saving bandwidth and processing). If not used, or if no canonical link was found, an attempt will be made against the HTTP headers obtained (if any) just after fetching a document. If no canonical link was found there, then the content is evaluated.

A canonical link found to be the same as the current page reference is ignored.

Since:

2.2.0

Author:

Pascal Essiembre

Method Summary

All Methods Instance Methods Abstract Methods
Modifier and Type	Method	Description
`String`	`detectFromContent(String reference, InputStream is, ContentType contentType)`	Detects from a document content the presence of a canonical URL.
`String`	`detectFromMetadata(String reference, Properties metadata)`	Detects from metadata gathered so far, which when invoked, is normally the HTTP header values.

- Method Detail
  - detectFromMetadata
```
String detectFromMetadata(String reference,
                          Properties metadata)
```
    Detects from metadata gathered so far, which when invoked, is normally the HTTP header values.
    
    Parameters:
    
    reference - document reference
    
    metadata - metadata object containing HTTP headers
    
    Returns:
    
    the detected canonical URL or null if none is found.
  - detectFromContent
```
String detectFromContent(String reference,
                         InputStream is,
                         ContentType contentType)
                  throws IOException
```
    Detects from a document content the presence of a canonical URL. This occur before a document gets parsed and may apply to only a few content types.
    
    Parameters:
    
    reference - document reference
    
    is - the document content input stream
    
    contentType - the document content type
    
    Returns:
    
    the detected canonical URL or null if none is found.
    
    Throws:
    
    IOException - problem reading content