Class GenericHttpFetcher
- java.lang.Object
-
- com.norconex.collector.http.fetch.AbstractHttpFetcher
-
- com.norconex.collector.http.fetch.impl.GenericHttpFetcher
-
- All Implemented Interfaces:
IHttpFetcher
,IEventListener<Event>
,IXMLConfigurable
,EventListener
,Consumer<Event>
public class GenericHttpFetcher extends AbstractHttpFetcher
Default implementation of
IHttpFetcher
, based on Apache HttpClient.The "validStatusCodes" and "notFoundStatusCodes" configuration options expect a comma-separated list of HTTP response codes. If a code is added to both, the valid list takes precedence.
Accepted HTTP methods
By default this fetcher accepts HTTP GET and HEAD requests. You can limit it to only process one method with
GenericHttpFetcherConfig.setHttpMethods(List)
.Content type and character encoding
The default way for the HTTP Collector to identify the content type and character encoding of a document is to rely on the "Content-Type" HTTP response header. Web servers can sometimes return invalid or missing content type and character encoding information. You can optionally decide not to trust web servers HTTP responses and have the collector perform its own content type and encoding detection. Such detection can be enabled with
GenericHttpFetcherConfig.setForceContentTypeDetection(boolean)
andGenericHttpFetcherConfig.setForceCharsetDetection(boolean)
.XML configuration entries expecting millisecond durations can be provided in human-readable format (English only), as per
DurationParser
(e.g., "5 minutes and 30 seconds" or "5m30s").HSTS Support
Upon first encountering a secure site, this fetcher will check whether the site root domain has the "Strict-Transport-Security" (HSTS) policy support part of its HTTP response headers. That information gets cached for future requests. If the site supports HSTS, any non-secure URLs encountered on the same domain will be automatically converted to "https" (including sub-domains if HSTS indicates as such).
If you want to convert non-secure URLs secure ones regardless of website HSTS support, use
GenericURLNormalizer.Normalization.secureScheme
instead. To disable HSTS support, useGenericHttpFetcherConfig.setDisableHSTS(boolean)
.Pro-active change detection
This fetcher takes advantage of the ETag and If-Modified-Since HTTP specifications.
On subsequent crawls, HTTP requests will include previously cached
ETag
andIf-Modified-Since
values to tell supporting servers we only want to download a document if it was modified since our last request. To disable support for pro-active change detection, you can useGenericHttpFetcherConfig.setDisableIfModifiedSince(boolean)
andGenericHttpFetcherConfig.setDisableETag(boolean)
.These settings have no effect for web servers not supporting them.
XML configuration usage:
<fetcher class="com.norconex.collector.http.fetch.impl.GenericHttpFetcher"> <userAgent>(identify yourself!)</userAgent> <cookieSpec> [STANDARD|DEFAULT|IGNORE_COOKIES|NETSCAPE|STANDARD_STRICT] </cookieSpec> <connectionTimeout>(milliseconds)</connectionTimeout> <socketTimeout>(milliseconds)</socketTimeout> <connectionRequestTimeout>(milliseconds)</connectionRequestTimeout> <connectionCharset>...</connectionCharset> <expectContinueEnabled>[false|true]</expectContinueEnabled> <maxRedirects>...</maxRedirects> <redirectURLProvider>(implementation handling redirects)</redirectURLProvider> <localAddress>...</localAddress> <maxConnections>...</maxConnections> <maxConnectionsPerRoute>...</maxConnectionsPerRoute> <maxConnectionIdleTime>(milliseconds)</maxConnectionIdleTime> <maxConnectionInactiveTime>(milliseconds)</maxConnectionInactiveTime> <!-- Be warned: trusting all certificates is usually a bad idea. --> <trustAllSSLCertificates>[false|true]</trustAllSSLCertificates> <!-- You can specify SSL/TLS protocols to use --> <sslProtocols>(coma-separated list)</sslProtocols> <!-- Disable Server Name Indication (SNI) --> <disableSNI>[false|true]</disableSNI> <!-- Disable support for website "Strict-Transport-Security" setting. --> <disableHSTS>[false|true]</disableHSTS> <!-- You can use a specific key store for SSL Certificates --> <keyStoreFile/> <proxySettings/> <!-- HTTP request header constants passed on every HTTP requests --> <headers> <header name="(header name)"> (header value) </header> <!-- You can repeat this header tag as needed. --> </headers> <!-- Disable conditionally getting a document based on last crawl date. --> <disableIfModifiedSince>[false|true]</disableIfModifiedSince> <!-- Disable ETag support. --> <disableETag>[false|true]</disableETag> <!-- Optional authentication details. --> <authentication> <method>[form|basic|digest|ntlm|spnego|kerberos]</method> <!-- These apply to any authentication mechanism --> <credentials/> <!-- These apply to FORM authentication --> <formUsernameField>...</formUsernameField> <formPasswordField>...</formPasswordField> <url> (Either a login form's action target URL or the URL of a page containing a login form if a "formSelector" is specified.) </url> <formCharset>...</formCharset> <!-- Extra form parameters required to authenticate (since 2.8.0) --> <formParams> <param name="(param name)"> (param value) </param> <!-- You can repeat this param tag as needed. --> </formParams> <formSelector> (CSS selector identifying the login page. E.g., "form") </formSelector> <!-- These apply to both BASIC and DIGEST authentication --> <host/> <realm>...</realm> <!-- This applies to BASIC authentication --> <preemptive>[false|true]</preemptive> <!-- These apply to NTLM authentication --> <host/> <workstation>...</workstation> <domain>...</domain> </authentication> <validStatusCodes>(defaults to 200)</validStatusCodes> <notFoundStatusCodes>(defaults to 404)</notFoundStatusCodes> <headersPrefix>(string to prefix headers)</headersPrefix> <!-- Force detect, or only when not provided in HTTP response headers --> <forceContentTypeDetection>[false|true]</forceContentTypeDetection> <forceCharsetDetection>[false|true]</forceCharsetDetection> <referenceFilters> <!-- multiple "filter" tags allowed --> <filter class="(any reference filter class)"> (Restrict usage of this fetcher to matching reference filters. Refer to the documentation for the IReferenceFilter implementation you are using here for usage details.) </filter> </referenceFilters> <!-- Comma-separated list of supported HTTP methods. --> <httpMethods>(defaults to: GET, HEAD)</httpMethods> </fetcher>
XML usage example:
<fetcher class="GenericHttpFetcher"> <authentication> <method>form</method> <credentials> <username>joeUser</username> <password>joePasword</password> </credentials> <formUsernameField>loginUser</formUsernameField> <formPasswordField>loginPwd</formPasswordField> <url>http://www.example.com/login/submit</url> </authentication> </fetcher>
The above example will authenticate the crawler to a web site before crawling. The website uses an HTML form with a username and password fields called "loginUser" and "loginPwd".
- Since:
- 3.0.0 (Merged from GenericDocumentFetcher and GenericHttpClientFactory)
- Author:
- Pascal Essiembre
-
-
Field Summary
Fields Modifier and Type Field Description static String
AUTH_METHOD_BASIC
BASIC authentication method.static String
AUTH_METHOD_DIGEST
DIGEST authentication method.static String
AUTH_METHOD_FORM
Form-based authentication method.static String
AUTH_METHOD_KERBEROS
Experimental: Kerberos authentication method.static String
AUTH_METHOD_NTLM
NTLM authentication method.static String
AUTH_METHOD_SPNEGO
Experimental: SPNEGO authentication method.
-
Constructor Summary
Constructors Constructor Description GenericHttpFetcher()
GenericHttpFetcher(GenericHttpFetcherConfig httpFetcherConfig)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected boolean
accept(HttpMethod httpMethod)
Whether the supplied HttpMethod is supported by this fetcher.protected void
authenticateUsingForm(org.apache.http.client.HttpClient httpClient)
protected void
buildCustomHttpClient(org.apache.http.impl.client.HttpClientBuilder builder)
For implementors to subclass.protected org.apache.http.config.ConnectionConfig
createConnectionConfig()
protected org.apache.http.client.CredentialsProvider
createCredentialsProvider()
protected org.apache.http.client.CookieStore
createDefaultCookieStore()
Creates the default cookie store to be added to each request context.protected List<org.apache.http.Header>
createDefaultRequestHeaders()
Creates a list of HTTP headers based on configuration.protected org.apache.http.client.HttpClient
createHttpClient()
protected org.apache.http.HttpHost
createProxy()
protected org.apache.http.client.config.RequestConfig
createRequestConfig()
protected org.apache.http.conn.SchemePortResolver
createSchemePortResolver()
protected SSLContext
createSSLContext()
protected org.apache.http.conn.socket.LayeredConnectionSocketFactory
createSSLSocketFactory(SSLContext sslContext)
boolean
equals(Object other)
IHttpFetchResponse
fetch(CrawlDoc doc, HttpMethod httpMethod)
Performs an HTTP request for the supplied document reference and HTTP method.protected void
fetcherShutdown(HttpCollector c)
Invoked once per fetcher when the collector ends.protected void
fetcherStartup(HttpCollector c)
Invoked once per fetcher instance, when the collector starts.GenericHttpFetcherConfig
getConfig()
org.apache.http.client.HttpClient
getHttpClient()
String
getUserAgent()
int
hashCode()
void
loadHttpFetcherFromXML(XML xml)
void
saveHttpFetcherToXML(XML xml)
String
toString()
-
Methods inherited from class com.norconex.collector.http.fetch.AbstractHttpFetcher
accept, accept, fetcherThreadBegin, fetcherThreadEnd, getReferenceFilters, loadFromXML, saveToXML, setReferenceFilters, setReferenceFilters
-
-
-
-
Field Detail
-
AUTH_METHOD_FORM
public static final String AUTH_METHOD_FORM
Form-based authentication method.- See Also:
- Constant Field Values
-
AUTH_METHOD_BASIC
public static final String AUTH_METHOD_BASIC
BASIC authentication method.- See Also:
- Constant Field Values
-
AUTH_METHOD_DIGEST
public static final String AUTH_METHOD_DIGEST
DIGEST authentication method.- See Also:
- Constant Field Values
-
AUTH_METHOD_NTLM
public static final String AUTH_METHOD_NTLM
NTLM authentication method.- See Also:
- Constant Field Values
-
AUTH_METHOD_SPNEGO
public static final String AUTH_METHOD_SPNEGO
Experimental: SPNEGO authentication method.- See Also:
- Constant Field Values
-
AUTH_METHOD_KERBEROS
public static final String AUTH_METHOD_KERBEROS
Experimental: Kerberos authentication method.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
GenericHttpFetcher
public GenericHttpFetcher()
-
GenericHttpFetcher
public GenericHttpFetcher(GenericHttpFetcherConfig httpFetcherConfig)
-
-
Method Detail
-
getConfig
public GenericHttpFetcherConfig getConfig()
-
getHttpClient
public org.apache.http.client.HttpClient getHttpClient()
-
accept
protected boolean accept(HttpMethod httpMethod)
Description copied from class:AbstractHttpFetcher
Whether the supplied HttpMethod is supported by this fetcher.- Specified by:
accept
in classAbstractHttpFetcher
- Parameters:
httpMethod
- the HTTP method- Returns:
true
if supported
-
fetcherStartup
protected void fetcherStartup(HttpCollector c)
Description copied from class:AbstractHttpFetcher
Invoked once per fetcher instance, when the collector starts. Default implementation does nothing.- Overrides:
fetcherStartup
in classAbstractHttpFetcher
- Parameters:
c
- collector
-
fetcherShutdown
protected void fetcherShutdown(HttpCollector c)
Description copied from class:AbstractHttpFetcher
Invoked once per fetcher when the collector ends. Default implementation does nothing.- Overrides:
fetcherShutdown
in classAbstractHttpFetcher
- Parameters:
c
- collector
-
getUserAgent
public String getUserAgent()
-
fetch
public IHttpFetchResponse fetch(CrawlDoc doc, HttpMethod httpMethod) throws HttpFetchException
Description copied from interface:IHttpFetcher
Performs an HTTP request for the supplied document reference and HTTP method.
For each HTTP method supported, implementors should do their best to populate the document and its
CrawlDocInfo
with as much information they can.Unsupported HTTP methods should return an HTTP response with the
CrawlState.UNSUPPORTED
state. To prevent userse having to configure multiple HTTP clients, implementors should try to support both theGET
andHEAD
methods. POST is only used in special cases and is often not used during a crawl session.A
null
method is treated as aGET
.- Parameters:
doc
- document to fetch or to use to make the request.httpMethod
- HTTP method- Returns:
- an HTTP response
- Throws:
HttpFetchException
- problem when fetching the document- See Also:
HttpFetchResponseBuilder.unsupported()
-
createHttpClient
protected org.apache.http.client.HttpClient createHttpClient()
-
buildCustomHttpClient
protected void buildCustomHttpClient(org.apache.http.impl.client.HttpClientBuilder builder)
For implementors to subclass. Does nothing by default.- Parameters:
builder
- http client builder
-
authenticateUsingForm
protected void authenticateUsingForm(org.apache.http.client.HttpClient httpClient)
-
createDefaultCookieStore
protected org.apache.http.client.CookieStore createDefaultCookieStore()
Creates the default cookie store to be added to each request context.- Returns:
- a cookie store
-
createDefaultRequestHeaders
protected List<org.apache.http.Header> createDefaultRequestHeaders()
Creates a list of HTTP headers based on configuration.
This method will also add a "Basic" authentication header if "preemptive" is
true
on the authentication configuration and credentials were supplied.- Returns:
- a list of HTTP request headers
-
createSchemePortResolver
protected org.apache.http.conn.SchemePortResolver createSchemePortResolver()
-
createRequestConfig
protected org.apache.http.client.config.RequestConfig createRequestConfig()
-
createProxy
protected org.apache.http.HttpHost createProxy()
-
createCredentialsProvider
protected org.apache.http.client.CredentialsProvider createCredentialsProvider()
-
createConnectionConfig
protected org.apache.http.config.ConnectionConfig createConnectionConfig()
-
createSSLSocketFactory
protected org.apache.http.conn.socket.LayeredConnectionSocketFactory createSSLSocketFactory(SSLContext sslContext)
-
createSSLContext
protected SSLContext createSSLContext()
-
loadHttpFetcherFromXML
public void loadHttpFetcherFromXML(XML xml)
- Specified by:
loadHttpFetcherFromXML
in classAbstractHttpFetcher
-
saveHttpFetcherToXML
public void saveHttpFetcherToXML(XML xml)
- Specified by:
saveHttpFetcherToXML
in classAbstractHttpFetcher
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractHttpFetcher
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractHttpFetcher
-
toString
public String toString()
- Overrides:
toString
in classAbstractHttpFetcher
-
-