Class GenericHttpFetcher
- All Implemented Interfaces:
IHttpFetcher,IEventListener<Event>,IXMLConfigurable,EventListener,Consumer<Event>
Default implementation of IHttpFetcher, based on Apache HttpClient.
The "validStatusCodes" and "notFoundStatusCodes" configuration options expect a comma-separated list of HTTP response codes. If a code is added to both, the valid list takes precedence.
Accepted HTTP methods
By default this fetcher accepts HTTP GET and HEAD requests. You can limit
it to only process one method with
GenericHttpFetcherConfig.setHttpMethods(List).
Content type and character encoding
The default way for the HTTP Collector to identify the content type
and character encoding of a document is to rely on the
"Content-Type"
HTTP response header. Web servers can sometimes return invalid
or missing content type and character encoding information.
You can optionally decide not to trust web servers HTTP responses and have
the collector perform its own content type and encoding detection.
Such detection can be enabled with
GenericHttpFetcherConfig.setForceContentTypeDetection(boolean)
and GenericHttpFetcherConfig.setForceCharsetDetection(boolean).
XML configuration entries expecting millisecond durations
can be provided in human-readable format (English only), as per
DurationParser (e.g., "5 minutes and 30 seconds" or "5m30s").
HSTS Support
Upon first encountering a secure site, this fetcher will check whether the site root domain has the "Strict-Transport-Security" (HSTS) policy support part of its HTTP response headers. That information gets cached for future requests. If the site supports HSTS, any non-secure URLs encountered on the same domain will be automatically converted to "https" (including sub-domains if HSTS indicates as such).
If you want to convert non-secure URLs secure ones regardless of website
HSTS support, use
GenericURLNormalizer.Normalization.secureScheme instead.
To disable HSTS support, use
GenericHttpFetcherConfig.setDisableHSTS(boolean).
Pro-active change detection
This fetcher takes advantage of the ETag and If-Modified-Since HTTP specifications.
On subsequent crawls, HTTP requests will include previously cached
ETag and If-Modified-Since values to tell
supporting servers we only want to download a document if it was modified
since our last request.
To disable support for pro-active change detection, you can use
GenericHttpFetcherConfig.setDisableIfModifiedSince(boolean) and
GenericHttpFetcherConfig.setDisableETag(boolean).
These settings have no effect for web servers not supporting them.
XML configuration usage:
<fetcher
class="com.norconex.collector.http.fetch.impl.GenericHttpFetcher">
<userAgent>(identify yourself!)</userAgent>
<cookieSpec>
[STANDARD|DEFAULT|IGNORE_COOKIES|NETSCAPE|STANDARD_STRICT]
</cookieSpec>
<connectionTimeout>(milliseconds)</connectionTimeout>
<socketTimeout>(milliseconds)</socketTimeout>
<connectionRequestTimeout>(milliseconds)</connectionRequestTimeout>
<connectionCharset>...</connectionCharset>
<expectContinueEnabled>[false|true]</expectContinueEnabled>
<maxRedirects>...</maxRedirects>
<redirectURLProvider>(implementation handling redirects)</redirectURLProvider>
<localAddress>...</localAddress>
<maxConnections>...</maxConnections>
<maxConnectionsPerRoute>...</maxConnectionsPerRoute>
<maxConnectionIdleTime>(milliseconds)</maxConnectionIdleTime>
<maxConnectionInactiveTime>(milliseconds)</maxConnectionInactiveTime>
<!-- Be warned: trusting all certificates is usually a bad idea. -->
<trustAllSSLCertificates>[false|true]</trustAllSSLCertificates>
<!-- You can specify SSL/TLS protocols to use -->
<sslProtocols>(coma-separated list)</sslProtocols>
<!-- Disable Server Name Indication (SNI) -->
<disableSNI>[false|true]</disableSNI>
<!-- Disable support for website "Strict-Transport-Security" setting. -->
<disableHSTS>[false|true]</disableHSTS>
<!-- You can use a specific key store for SSL Certificates -->
<keyStoreFile/>
<proxySettings/>
<!-- HTTP request header constants passed on every HTTP requests -->
<headers>
<header
name="(header name)">
(header value)
</header>
<!-- You can repeat this header tag as needed. -->
</headers>
<!-- Disable conditionally getting a document based on last crawl date. -->
<disableIfModifiedSince>[false|true]</disableIfModifiedSince>
<!-- Disable ETag support. -->
<disableETag>[false|true]</disableETag>
<!-- Optional authentication details. -->
<authentication>
<method>[form|basic|digest|ntlm|spnego|kerberos]</method>
<!-- These apply to any authentication mechanism -->
<credentials/>
<!-- These apply to FORM authentication -->
<formUsernameField>...</formUsernameField>
<formPasswordField>...</formPasswordField>
<url>
(Either a login form's action target URL or the URL of a page containing
a login form if a "formSelector" is specified.)
</url>
<formCharset>...</formCharset>
<!-- Extra form parameters required to authenticate (since 2.8.0) -->
<formParams>
<param
name="(param name)">
(param value)
</param>
<!-- You can repeat this param tag as needed. -->
</formParams>
<formSelector>
(CSS selector identifying the login page. E.g., "form")
</formSelector>
<!-- These apply to both BASIC and DIGEST authentication -->
<host/>
<realm>...</realm>
<!-- This applies to BASIC authentication -->
<preemptive>[false|true]</preemptive>
<!-- These apply to NTLM authentication -->
<host/>
<workstation>...</workstation>
<domain>...</domain>
</authentication>
<validStatusCodes>(defaults to 200)</validStatusCodes>
<notFoundStatusCodes>(defaults to 404)</notFoundStatusCodes>
<headersPrefix>(string to prefix headers)</headersPrefix>
<!-- Force detect, or only when not provided in HTTP response headers -->
<forceContentTypeDetection>[false|true]</forceContentTypeDetection>
<forceCharsetDetection>[false|true]</forceCharsetDetection>
<referenceFilters>
<!-- multiple "filter" tags allowed -->
<filter
class="(any reference filter class)">
(Restrict usage of this fetcher to matching reference filters.
Refer to the documentation for the IReferenceFilter implementation
you are using here for usage details.)
</filter>
</referenceFilters>
<!-- Comma-separated list of supported HTTP methods. -->
<httpMethods>(defaults to: GET, HEAD)</httpMethods>
</fetcher>
XML usage example:
<fetcher
class="GenericHttpFetcher">
<authentication>
<method>form</method>
<credentials>
<username>joeUser</username>
<password>joePasword</password>
</credentials>
<formUsernameField>loginUser</formUsernameField>
<formPasswordField>loginPwd</formPasswordField>
<url>http://www.example.com/login/submit</url>
</authentication>
</fetcher>
The above example will authenticate the crawler to a web site before crawling. The website uses an HTML form with a username and password fields called "loginUser" and "loginPwd".
- Since:
- 3.0.0 (Merged from GenericDocumentFetcher and GenericHttpClientFactory)
- Author:
- Pascal Essiembre
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringBASIC authentication method.static final StringDIGEST authentication method.static final StringForm-based authentication method.static final StringExperimental: Kerberos authentication method.static final StringNTLM authentication method.static final StringExperimental: SPNEGO authentication method. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected booleanaccept(HttpMethod httpMethod) Whether the supplied HttpMethod is supported by this fetcher.protected voidauthenticateUsingForm(org.apache.http.client.HttpClient httpClient) protected voidbuildCustomHttpClient(org.apache.http.impl.client.HttpClientBuilder builder) For implementors to subclass.protected org.apache.http.config.ConnectionConfigprotected org.apache.http.client.CredentialsProviderprotected org.apache.http.client.CookieStoreCreates the default cookie store to be added to each request context.protected List<org.apache.http.Header>Creates a list of HTTP headers based on configuration.protected org.apache.http.client.HttpClientprotected org.apache.http.HttpHostprotected org.apache.http.client.config.RequestConfigprotected org.apache.http.conn.SchemePortResolverprotected SSLContextprotected org.apache.http.conn.socket.LayeredConnectionSocketFactorycreateSSLSocketFactory(SSLContext sslContext) booleanfetch(CrawlDoc doc, HttpMethod httpMethod) Performs an HTTP request for the supplied document reference and HTTP method.protected voidInvoked once per fetcher when the collector ends.protected voidInvoked once per fetcher instance, when the collector starts.org.apache.http.client.HttpClientinthashCode()voidvoidsaveHttpFetcherToXML(XML xml) toString()Methods inherited from class com.norconex.collector.http.fetch.AbstractHttpFetcher
accept, accept, fetcherThreadBegin, fetcherThreadEnd, getReferenceFilters, loadFromXML, saveToXML, setReferenceFilters, setReferenceFilters
-
Field Details
-
AUTH_METHOD_FORM
Form-based authentication method.- See Also:
-
AUTH_METHOD_BASIC
BASIC authentication method.- See Also:
-
AUTH_METHOD_DIGEST
DIGEST authentication method.- See Also:
-
AUTH_METHOD_NTLM
NTLM authentication method.- See Also:
-
AUTH_METHOD_SPNEGO
Experimental: SPNEGO authentication method.- See Also:
-
AUTH_METHOD_KERBEROS
Experimental: Kerberos authentication method.- See Also:
-
-
Constructor Details
-
GenericHttpFetcher
public GenericHttpFetcher() -
GenericHttpFetcher
-
-
Method Details
-
getConfig
-
getHttpClient
public org.apache.http.client.HttpClient getHttpClient() -
accept
Description copied from class:AbstractHttpFetcherWhether the supplied HttpMethod is supported by this fetcher.- Specified by:
acceptin classAbstractHttpFetcher- Parameters:
httpMethod- the HTTP method- Returns:
trueif supported
-
fetcherStartup
Description copied from class:AbstractHttpFetcherInvoked once per fetcher instance, when the collector starts. Default implementation does nothing.- Overrides:
fetcherStartupin classAbstractHttpFetcher- Parameters:
c- collector
-
fetcherShutdown
Description copied from class:AbstractHttpFetcherInvoked once per fetcher when the collector ends. Default implementation does nothing.- Overrides:
fetcherShutdownin classAbstractHttpFetcher- Parameters:
c- collector
-
getUserAgent
-
fetch
Description copied from interface:IHttpFetcherPerforms an HTTP request for the supplied document reference and HTTP method.
For each HTTP method supported, implementors should do their best to populate the document and its
CrawlDocInfowith as much information they can.Unsupported HTTP methods should return an HTTP response with the
CrawlState.UNSUPPORTEDstate. To prevent userse having to configure multiple HTTP clients, implementors should try to support both theGETandHEADmethods. POST is only used in special cases and is often not used during a crawl session.A
nullmethod is treated as aGET.- Parameters:
doc- document to fetch or to use to make the request.httpMethod- HTTP method- Returns:
- an HTTP response
- Throws:
HttpFetchException- problem when fetching the document- See Also:
-
createHttpClient
protected org.apache.http.client.HttpClient createHttpClient() -
buildCustomHttpClient
protected void buildCustomHttpClient(org.apache.http.impl.client.HttpClientBuilder builder) For implementors to subclass. Does nothing by default.- Parameters:
builder- http client builder
-
authenticateUsingForm
protected void authenticateUsingForm(org.apache.http.client.HttpClient httpClient) -
createDefaultCookieStore
protected org.apache.http.client.CookieStore createDefaultCookieStore()Creates the default cookie store to be added to each request context.- Returns:
- a cookie store
-
createDefaultRequestHeaders
Creates a list of HTTP headers based on configuration.
This method will also add a "Basic" authentication header if "preemptive" is
trueon the authentication configuration and credentials were supplied.- Returns:
- a list of HTTP request headers
-
createSchemePortResolver
protected org.apache.http.conn.SchemePortResolver createSchemePortResolver() -
createRequestConfig
protected org.apache.http.client.config.RequestConfig createRequestConfig() -
createProxy
protected org.apache.http.HttpHost createProxy() -
createCredentialsProvider
protected org.apache.http.client.CredentialsProvider createCredentialsProvider() -
createConnectionConfig
protected org.apache.http.config.ConnectionConfig createConnectionConfig() -
createSSLSocketFactory
protected org.apache.http.conn.socket.LayeredConnectionSocketFactory createSSLSocketFactory(SSLContext sslContext) -
createSSLContext
-
loadHttpFetcherFromXML
- Specified by:
loadHttpFetcherFromXMLin classAbstractHttpFetcher
-
saveHttpFetcherToXML
- Specified by:
saveHttpFetcherToXMLin classAbstractHttpFetcher
-
equals
- Overrides:
equalsin classAbstractHttpFetcher
-
hashCode
public int hashCode()- Overrides:
hashCodein classAbstractHttpFetcher
-
toString
- Overrides:
toStringin classAbstractHttpFetcher
-