Interface ICanonicalLinkDetector
-
- All Known Implementing Classes:
GenericCanonicalLinkDetector
public interface ICanonicalLinkDetector
Detects and return any canonical URL found in documents, whether from the HTTP headers (metadata), or from a page content (usually HTML). Documents having a canonical URL reference in them are rejected in favor of the document represented by the canonical URL.
When a
HttpCrawlerConfig.isFetchHttpHead()
istrue
, a page won't be downloaded if a canonical link is found in the HTTP headers (saving bandwidth and processing). If not used, or if no canonical link was found, an attempt will be made against the HTTP headers obtained (if any) just after fetching a document. If no canonical link was found there, then the content is evaluated.A canonical link found to be the same as the current page reference is ignored.
- Since:
- 2.2.0
- Author:
- Pascal Essiembre
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description String
detectFromContent(String reference, InputStream is, ContentType contentType)
Detects from a document content the presence of a canonical URL.String
detectFromMetadata(String reference, Properties metadata)
Detects from metadata gathered so far, which when invoked, is normally the HTTP header values.
-
-
-
Method Detail
-
detectFromMetadata
String detectFromMetadata(String reference, Properties metadata)
Detects from metadata gathered so far, which when invoked, is normally the HTTP header values.- Parameters:
reference
- document referencemetadata
- metadata object containing HTTP headers- Returns:
- the detected canonical URL or
null
if none is found.
-
detectFromContent
String detectFromContent(String reference, InputStream is, ContentType contentType) throws IOException
Detects from a document content the presence of a canonical URL. This occur before a document gets parsed and may apply to only a few content types.- Parameters:
reference
- document referenceis
- the document content input streamcontentType
- the document content type- Returns:
- the detected canonical URL or
null
if none is found. - Throws:
IOException
- problem reading content
-
-