Class GenericCanonicalLinkDetector
- java.lang.Object
-
- com.norconex.collector.http.canon.impl.GenericCanonicalLinkDetector
-
- All Implemented Interfaces:
ICanonicalLinkDetector
,IXMLConfigurable
public class GenericCanonicalLinkDetector extends Object implements ICanonicalLinkDetector, IXMLConfigurable
Generic canonical link detector. It detects links from the HTTP headers as well as HTML files. Good canonical reference documentation can be found on this Google Webmaster Tools help page.
HTTP Headers
This detector will look for a metadata field (normally obtained from the HTTP Headers) name called "Link" with a value following this pattern:
<http://www.example.com/sample.pdf> rel="canonical"
All documents will be verified for a canonical link (not just HTML).
Document content
This detector will look within the HTML <head> tags for a <link> tag following this pattern:
<link rel="canonical" href="https://www.example.com/sample" />
Only HTML documents will be verified for a canonical link. By default, these content-types are considered HTML:
text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp
You can specify your own content types as long as they contain HTML text.
XML configuration usage:
<canonicalLinkDetector class="com.norconex.collector.http.canon.impl.GenericCanonicalLinkDetector" ignore="(false|true)"> <contentTypes> (CSV list of content types on which to perform canonical link detection. Leave blank or remove this tag to use defaults.) </contentTypes> </canonicalLinkDetector>
XML usage example:
<canonicalLinkDetector ignore="true"/>
The above example ignores canonical link resolution.
- Since:
- 2.2.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description GenericCanonicalLinkDetector()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description String
detectFromContent(String reference, InputStream is, ContentType contentType)
Detects from a document content the presence of a canonical URL.String
detectFromMetadata(String reference, Properties metadata)
Detects from metadata gathered so far, which when invoked, is normally the HTTP header values.boolean
equals(Object other)
List<ContentType>
getContentTypes()
int
hashCode()
void
loadFromXML(XML xml)
void
saveToXML(XML xml)
void
setContentTypes(ContentType... contentTypes)
Sets the content types on which to perform canonical link detection.void
setContentTypes(List<ContentType> contentTypes)
Sets the content types on which to perform canonical link detection.String
toString()
-
-
-
Method Detail
-
getContentTypes
public List<ContentType> getContentTypes()
-
setContentTypes
public void setContentTypes(ContentType... contentTypes)
Sets the content types on which to perform canonical link detection.- Parameters:
contentTypes
- content types
-
setContentTypes
public void setContentTypes(List<ContentType> contentTypes)
Sets the content types on which to perform canonical link detection.- Parameters:
contentTypes
- content types- Since:
- 3.0.0
-
detectFromMetadata
public String detectFromMetadata(String reference, Properties metadata)
Description copied from interface:ICanonicalLinkDetector
Detects from metadata gathered so far, which when invoked, is normally the HTTP header values.- Specified by:
detectFromMetadata
in interfaceICanonicalLinkDetector
- Parameters:
reference
- document referencemetadata
- metadata object containing HTTP headers- Returns:
- the detected canonical URL or
null
if none is found.
-
detectFromContent
public String detectFromContent(String reference, InputStream is, ContentType contentType) throws IOException
Description copied from interface:ICanonicalLinkDetector
Detects from a document content the presence of a canonical URL. This occur before a document gets parsed and may apply to only a few content types.- Specified by:
detectFromContent
in interfaceICanonicalLinkDetector
- Parameters:
reference
- document referenceis
- the document content input streamcontentType
- the document content type- Returns:
- the detected canonical URL or
null
if none is found. - Throws:
IOException
- problem reading content
-
loadFromXML
public void loadFromXML(XML xml)
- Specified by:
loadFromXML
in interfaceIXMLConfigurable
-
saveToXML
public void saveToXML(XML xml)
- Specified by:
saveToXML
in interfaceIXMLConfigurable
-
-