Class StandardRobotsMetaProvider
- java.lang.Object
-
- com.norconex.collector.http.robot.impl.StandardRobotsMetaProvider
-
- All Implemented Interfaces:
IRobotsMetaProvider
,IXMLConfigurable
public class StandardRobotsMetaProvider extends Object implements IRobotsMetaProvider, IXMLConfigurable
Implementation of
IRobotsMetaProvider
as per X-Robots-Tag and ROBOTS standards. Extracts robots information from "ROBOTS" meta tag in an HTML page or "X-Robots-Tag" tag in the HTTP header (see https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag and http://www.robotstxt.org/meta.html).If you specified a prefix for the HTTP headers, make sure to specify it again here or the robots meta tags will not be found.
If robots instructions are provided in both the HTML page and HTTP header, the ones in HTML page will take precedence, and the ones in HTTP header will be ignored.
XML configuration usage:
<robotsMeta ignore="false" class="com.norconex.collector.http.robot.impl.StandardRobotsMetaProvider"> <headersPrefix>(string prefixing headers)</headersPrefix> </robotsMeta>
XML usage example:
<robotsMeta ignore="true"/>
The above example ignores robot meta information.
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description StandardRobotsMetaProvider()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
String
getHeadersPrefix()
RobotsMeta
getRobotsMeta(Reader document, String documentUrl, ContentType contentType, Properties httpHeaders)
Extracts Robots meta information for a page, if any.int
hashCode()
void
loadFromXML(XML xml)
void
saveToXML(XML xml)
void
setHeadersPrefix(String headersPrefix)
String
toString()
-
-
-
Method Detail
-
getRobotsMeta
public RobotsMeta getRobotsMeta(Reader document, String documentUrl, ContentType contentType, Properties httpHeaders) throws IOException
Description copied from interface:IRobotsMetaProvider
Extracts Robots meta information for a page, if any.- Specified by:
getRobotsMeta
in interfaceIRobotsMetaProvider
- Parameters:
document
- the documentdocumentUrl
- document urlcontentType
- the document content typehttpHeaders
- the document HTTP Headers- Returns:
- robots meta instance
- Throws:
IOException
- problem reading the document
-
getHeadersPrefix
public String getHeadersPrefix()
-
setHeadersPrefix
public void setHeadersPrefix(String headersPrefix)
-
loadFromXML
public void loadFromXML(XML xml)
- Specified by:
loadFromXML
in interfaceIXMLConfigurable
-
saveToXML
public void saveToXML(XML xml)
- Specified by:
saveToXML
in interfaceIXMLConfigurable
-
-