Class StandardRobotsMetaProvider
java.lang.Object
com.norconex.collector.http.robot.impl.StandardRobotsMetaProvider
- All Implemented Interfaces:
IRobotsMetaProvider,IXMLConfigurable
public class StandardRobotsMetaProvider
extends Object
implements IRobotsMetaProvider, IXMLConfigurable
Implementation of IRobotsMetaProvider as per X-Robots-Tag
and ROBOTS standards.
Extracts robots information from "ROBOTS" meta tag in an HTML page
or "X-Robots-Tag" tag in the HTTP header (see
https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
and
http://www.robotstxt.org/meta.html).
If you specified a prefix for the HTTP headers, make sure to specify it again here or the robots meta tags will not be found.
If robots instructions are provided in both the HTML page and HTTP header, the ones in HTML page will take precedence, and the ones in HTTP header will be ignored.
XML configuration usage:
<robotsMeta
ignore="false"
class="com.norconex.collector.http.robot.impl.StandardRobotsMetaProvider">
<headersPrefix>(string prefixing headers)</headersPrefix>
</robotsMeta>
XML usage example:
<robotsMeta
ignore="true"/>
The above example ignores robot meta information.
- Author:
- Pascal Essiembre
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbooleangetRobotsMeta(Reader document, String documentUrl, ContentType contentType, Properties httpHeaders) Extracts Robots meta information for a page, if any.inthashCode()voidloadFromXML(XML xml) voidvoidsetHeadersPrefix(String headersPrefix) toString()
-
Constructor Details
-
StandardRobotsMetaProvider
public StandardRobotsMetaProvider()
-
-
Method Details
-
getRobotsMeta
public RobotsMeta getRobotsMeta(Reader document, String documentUrl, ContentType contentType, Properties httpHeaders) throws IOException Description copied from interface:IRobotsMetaProviderExtracts Robots meta information for a page, if any.- Specified by:
getRobotsMetain interfaceIRobotsMetaProvider- Parameters:
document- the documentdocumentUrl- document urlcontentType- the document content typehttpHeaders- the document HTTP Headers- Returns:
- robots meta instance
- Throws:
IOException- problem reading the document
-
getHeadersPrefix
-
setHeadersPrefix
-
loadFromXML
- Specified by:
loadFromXMLin interfaceIXMLConfigurable
-
saveToXML
- Specified by:
saveToXMLin interfaceIXMLConfigurable
-
equals
-
hashCode
public int hashCode() -
toString
-