Class GenericHttpFetcher
- java.lang.Object
-
- com.norconex.collector.http.fetch.AbstractHttpFetcher
-
- com.norconex.collector.http.fetch.impl.GenericHttpFetcher
-
- All Implemented Interfaces:
IHttpFetcher,IEventListener<Event>,IXMLConfigurable,EventListener,Consumer<Event>
public class GenericHttpFetcher extends AbstractHttpFetcher
Default implementation of
IHttpFetcher, based on Apache HttpClient.The "validStatusCodes" and "notFoundStatusCodes" configuration options expect a comma-separated list of HTTP response codes. If a code is added to both, the valid list takes precedence.
Accepted HTTP methods
By default this fetcher accepts HTTP GET and HEAD requests. You can limit it to only process one method with
GenericHttpFetcherConfig.setHttpMethods(List).Content type and character encoding
The default way for the HTTP Collector to identify the content type and character encoding of a document is to rely on the "Content-Type" HTTP response header. Web servers can sometimes return invalid or missing content type and character encoding information. You can optionally decide not to trust web servers HTTP responses and have the collector perform its own content type and encoding detection. Such detection can be enabled with
GenericHttpFetcherConfig.setForceContentTypeDetection(boolean)andGenericHttpFetcherConfig.setForceCharsetDetection(boolean).XML configuration entries expecting millisecond durations can be provided in human-readable format (English only), as per
DurationParser(e.g., "5 minutes and 30 seconds" or "5m30s").HSTS Support
Upon first encountering a secure site, this fetcher will check whether the site root domain has the "Strict-Transport-Security" (HSTS) policy support part of its HTTP response headers. That information gets cached for future requests. If the site supports HSTS, any non-secure URLs encountered on the same domain will be automatically converted to "https" (including sub-domains if HSTS indicates as such).
If you want to convert non-secure URLs secure ones regardless of website HSTS support, use
GenericURLNormalizer.Normalization.secureSchemeinstead. To disable HSTS support, useGenericHttpFetcherConfig.setDisableHSTS(boolean).Pro-active change detection
This fetcher takes advantage of the ETag and If-Modified-Since HTTP specifications.
On subsequent crawls, HTTP requests will include previously cached
ETagandIf-Modified-Sincevalues to tell supporting servers we only want to download a document if it was modified since our last request. To disable support for pro-active change detection, you can useGenericHttpFetcherConfig.setDisableIfModifiedSince(boolean)andGenericHttpFetcherConfig.setDisableETag(boolean).These settings have no effect for web servers not supporting them.
XML configuration usage:
<fetcher class="com.norconex.collector.http.fetch.impl.GenericHttpFetcher"> <userAgent>(identify yourself!)</userAgent> <cookieSpec> [STANDARD|DEFAULT|IGNORE_COOKIES|NETSCAPE|STANDARD_STRICT] </cookieSpec> <connectionTimeout>(milliseconds)</connectionTimeout> <socketTimeout>(milliseconds)</socketTimeout> <connectionRequestTimeout>(milliseconds)</connectionRequestTimeout> <connectionCharset>...</connectionCharset> <expectContinueEnabled>[false|true]</expectContinueEnabled> <maxRedirects>...</maxRedirects> <redirectURLProvider>(implementation handling redirects)</redirectURLProvider> <localAddress>...</localAddress> <maxConnections>...</maxConnections> <maxConnectionsPerRoute>...</maxConnectionsPerRoute> <maxConnectionIdleTime>(milliseconds)</maxConnectionIdleTime> <maxConnectionInactiveTime>(milliseconds)</maxConnectionInactiveTime> <!-- Be warned: trusting all certificates is usually a bad idea. --> <trustAllSSLCertificates>[false|true]</trustAllSSLCertificates> <!-- You can specify SSL/TLS protocols to use --> <sslProtocols>(coma-separated list)</sslProtocols> <!-- Disable Server Name Indication (SNI) --> <disableSNI>[false|true]</disableSNI> <!-- Disable support for website "Strict-Transport-Security" setting. --> <disableHSTS>[false|true]</disableHSTS> <!-- You can use a specific key store for SSL Certificates --> <keyStoreFile/> <proxySettings/> <!-- HTTP request header constants passed on every HTTP requests --> <headers> <header name="(header name)"> (header value) </header> <!-- You can repeat this header tag as needed. --> </headers> <!-- Disable conditionally getting a document based on last crawl date. --> <disableIfModifiedSince>[false|true]</disableIfModifiedSince> <!-- Disable ETag support. --> <disableETag>[false|true]</disableETag> <!-- Optional authentication details. --> <authentication> <method>[form|basic|digest|ntlm|spnego|kerberos]</method> <!-- These apply to any authentication mechanism --> <credentials/> <!-- These apply to FORM authentication --> <formUsernameField>...</formUsernameField> <formPasswordField>...</formPasswordField> <url> (Either a login form's action target URL or the URL of a page containing a login form if a "formSelector" is specified.) </url> <formCharset>...</formCharset> <!-- Extra form parameters required to authenticate (since 2.8.0) --> <formParams> <param name="(param name)"> (param value) </param> <!-- You can repeat this param tag as needed. --> </formParams> <formSelector> (CSS selector identifying the login page. E.g., "form") </formSelector> <!-- These apply to both BASIC and DIGEST authentication --> <host/> <realm>...</realm> <!-- This applies to BASIC authentication --> <preemptive>[false|true]</preemptive> <!-- These apply to NTLM authentication --> <host/> <workstation>...</workstation> <domain>...</domain> </authentication> <validStatusCodes>(defaults to 200)</validStatusCodes> <notFoundStatusCodes>(defaults to 404)</notFoundStatusCodes> <headersPrefix>(string to prefix headers)</headersPrefix> <!-- Force detect, or only when not provided in HTTP response headers --> <forceContentTypeDetection>[false|true]</forceContentTypeDetection> <forceCharsetDetection>[false|true]</forceCharsetDetection> <referenceFilters> <!-- multiple "filter" tags allowed --> <filter class="(any reference filter class)"> (Restrict usage of this fetcher to matching reference filters. Refer to the documentation for the IReferenceFilter implementation you are using here for usage details.) </filter> </referenceFilters> <!-- Comma-separated list of supported HTTP methods. --> <httpMethods>(defaults to: GET, HEAD)</httpMethods> </fetcher>XML usage example:
<fetcher class="GenericHttpFetcher"> <authentication> <method>form</method> <credentials> <username>joeUser</username> <password>joePasword</password> </credentials> <formUsernameField>loginUser</formUsernameField> <formPasswordField>loginPwd</formPasswordField> <url>http://www.example.com/login/submit</url> </authentication> </fetcher>The above example will authenticate the crawler to a web site before crawling. The website uses an HTML form with a username and password fields called "loginUser" and "loginPwd".
- Since:
- 3.0.0 (Merged from GenericDocumentFetcher and GenericHttpClientFactory)
- Author:
- Pascal Essiembre
-
-
Field Summary
Fields Modifier and Type Field Description static StringAUTH_METHOD_BASICBASIC authentication method.static StringAUTH_METHOD_DIGESTDIGEST authentication method.static StringAUTH_METHOD_FORMForm-based authentication method.static StringAUTH_METHOD_KERBEROSExperimental: Kerberos authentication method.static StringAUTH_METHOD_NTLMNTLM authentication method.static StringAUTH_METHOD_SPNEGOExperimental: SPNEGO authentication method.
-
Constructor Summary
Constructors Constructor Description GenericHttpFetcher()GenericHttpFetcher(GenericHttpFetcherConfig httpFetcherConfig)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected booleanaccept(HttpMethod httpMethod)Whether the supplied HttpMethod is supported by this fetcher.protected voidauthenticateUsingForm(org.apache.http.client.HttpClient httpClient)protected voidbuildCustomHttpClient(org.apache.http.impl.client.HttpClientBuilder builder)For implementors to subclass.protected org.apache.http.config.ConnectionConfigcreateConnectionConfig()protected org.apache.http.client.CredentialsProvidercreateCredentialsProvider()protected org.apache.http.client.CookieStorecreateDefaultCookieStore()Creates the default cookie store to be added to each request context.protected List<org.apache.http.Header>createDefaultRequestHeaders()Creates a list of HTTP headers based on configuration.protected org.apache.http.client.HttpClientcreateHttpClient()protected org.apache.http.HttpHostcreateProxy()protected org.apache.http.client.config.RequestConfigcreateRequestConfig()protected org.apache.http.conn.SchemePortResolvercreateSchemePortResolver()protected SSLContextcreateSSLContext()protected org.apache.http.conn.socket.LayeredConnectionSocketFactorycreateSSLSocketFactory(SSLContext sslContext)booleanequals(Object other)IHttpFetchResponsefetch(CrawlDoc doc, HttpMethod httpMethod)Performs an HTTP request for the supplied document reference and HTTP method.protected voidfetcherShutdown(HttpCollector c)Invoked once per fetcher when the collector ends.protected voidfetcherStartup(HttpCollector c)Invoked once per fetcher instance, when the collector starts.GenericHttpFetcherConfiggetConfig()org.apache.http.client.HttpClientgetHttpClient()StringgetUserAgent()inthashCode()voidloadHttpFetcherFromXML(XML xml)voidsaveHttpFetcherToXML(XML xml)StringtoString()-
Methods inherited from class com.norconex.collector.http.fetch.AbstractHttpFetcher
accept, accept, fetcherThreadBegin, fetcherThreadEnd, getReferenceFilters, loadFromXML, saveToXML, setReferenceFilters, setReferenceFilters
-
-
-
-
Field Detail
-
AUTH_METHOD_FORM
public static final String AUTH_METHOD_FORM
Form-based authentication method.- See Also:
- Constant Field Values
-
AUTH_METHOD_BASIC
public static final String AUTH_METHOD_BASIC
BASIC authentication method.- See Also:
- Constant Field Values
-
AUTH_METHOD_DIGEST
public static final String AUTH_METHOD_DIGEST
DIGEST authentication method.- See Also:
- Constant Field Values
-
AUTH_METHOD_NTLM
public static final String AUTH_METHOD_NTLM
NTLM authentication method.- See Also:
- Constant Field Values
-
AUTH_METHOD_SPNEGO
public static final String AUTH_METHOD_SPNEGO
Experimental: SPNEGO authentication method.- See Also:
- Constant Field Values
-
AUTH_METHOD_KERBEROS
public static final String AUTH_METHOD_KERBEROS
Experimental: Kerberos authentication method.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
GenericHttpFetcher
public GenericHttpFetcher()
-
GenericHttpFetcher
public GenericHttpFetcher(GenericHttpFetcherConfig httpFetcherConfig)
-
-
Method Detail
-
getConfig
public GenericHttpFetcherConfig getConfig()
-
getHttpClient
public org.apache.http.client.HttpClient getHttpClient()
-
accept
protected boolean accept(HttpMethod httpMethod)
Description copied from class:AbstractHttpFetcherWhether the supplied HttpMethod is supported by this fetcher.- Specified by:
acceptin classAbstractHttpFetcher- Parameters:
httpMethod- the HTTP method- Returns:
trueif supported
-
fetcherStartup
protected void fetcherStartup(HttpCollector c)
Description copied from class:AbstractHttpFetcherInvoked once per fetcher instance, when the collector starts. Default implementation does nothing.- Overrides:
fetcherStartupin classAbstractHttpFetcher- Parameters:
c- collector
-
fetcherShutdown
protected void fetcherShutdown(HttpCollector c)
Description copied from class:AbstractHttpFetcherInvoked once per fetcher when the collector ends. Default implementation does nothing.- Overrides:
fetcherShutdownin classAbstractHttpFetcher- Parameters:
c- collector
-
getUserAgent
public String getUserAgent()
-
fetch
public IHttpFetchResponse fetch(CrawlDoc doc, HttpMethod httpMethod) throws HttpFetchException
Description copied from interface:IHttpFetcherPerforms an HTTP request for the supplied document reference and HTTP method.
For each HTTP method supported, implementors should do their best to populate the document and its
CrawlDocInfowith as much information they can.Unsupported HTTP methods should return an HTTP response with the
CrawlState.UNSUPPORTEDstate. To prevent userse having to configure multiple HTTP clients, implementors should try to support both theGETandHEADmethods. POST is only used in special cases and is often not used during a crawl session.A
nullmethod is treated as aGET.- Parameters:
doc- document to fetch or to use to make the request.httpMethod- HTTP method- Returns:
- an HTTP response
- Throws:
HttpFetchException- problem when fetching the document- See Also:
HttpFetchResponseBuilder.unsupported()
-
createHttpClient
protected org.apache.http.client.HttpClient createHttpClient()
-
buildCustomHttpClient
protected void buildCustomHttpClient(org.apache.http.impl.client.HttpClientBuilder builder)
For implementors to subclass. Does nothing by default.- Parameters:
builder- http client builder
-
authenticateUsingForm
protected void authenticateUsingForm(org.apache.http.client.HttpClient httpClient)
-
createDefaultCookieStore
protected org.apache.http.client.CookieStore createDefaultCookieStore()
Creates the default cookie store to be added to each request context.- Returns:
- a cookie store
-
createDefaultRequestHeaders
protected List<org.apache.http.Header> createDefaultRequestHeaders()
Creates a list of HTTP headers based on configuration.
This method will also add a "Basic" authentication header if "preemptive" is
trueon the authentication configuration and credentials were supplied.- Returns:
- a list of HTTP request headers
-
createSchemePortResolver
protected org.apache.http.conn.SchemePortResolver createSchemePortResolver()
-
createRequestConfig
protected org.apache.http.client.config.RequestConfig createRequestConfig()
-
createProxy
protected org.apache.http.HttpHost createProxy()
-
createCredentialsProvider
protected org.apache.http.client.CredentialsProvider createCredentialsProvider()
-
createConnectionConfig
protected org.apache.http.config.ConnectionConfig createConnectionConfig()
-
createSSLSocketFactory
protected org.apache.http.conn.socket.LayeredConnectionSocketFactory createSSLSocketFactory(SSLContext sslContext)
-
createSSLContext
protected SSLContext createSSLContext()
-
loadHttpFetcherFromXML
public void loadHttpFetcherFromXML(XML xml)
- Specified by:
loadHttpFetcherFromXMLin classAbstractHttpFetcher
-
saveHttpFetcherToXML
public void saveHttpFetcherToXML(XML xml)
- Specified by:
saveHttpFetcherToXMLin classAbstractHttpFetcher
-
equals
public boolean equals(Object other)
- Overrides:
equalsin classAbstractHttpFetcher
-
hashCode
public int hashCode()
- Overrides:
hashCodein classAbstractHttpFetcher
-
toString
public String toString()
- Overrides:
toStringin classAbstractHttpFetcher
-
-