Class GenericHttpFetcher

java.lang.Object
com.norconex.collector.http.fetch.AbstractHttpFetcher
com.norconex.collector.http.fetch.impl.GenericHttpFetcher
All Implemented Interfaces:
IHttpFetcher, IEventListener<Event>, IXMLConfigurable, EventListener, Consumer<Event>

public class GenericHttpFetcher extends AbstractHttpFetcher

Default implementation of IHttpFetcher, based on Apache HttpClient.

The "validStatusCodes" and "notFoundStatusCodes" configuration options expect a comma-separated list of HTTP response codes. If a code is added to both, the valid list takes precedence.

Accepted HTTP methods

By default this fetcher accepts HTTP GET and HEAD requests. You can limit it to only process one method with GenericHttpFetcherConfig.setHttpMethods(List).

Content type and character encoding

The default way for the HTTP Collector to identify the content type and character encoding of a document is to rely on the "Content-Type" HTTP response header. Web servers can sometimes return invalid or missing content type and character encoding information. You can optionally decide not to trust web servers HTTP responses and have the collector perform its own content type and encoding detection. Such detection can be enabled with GenericHttpFetcherConfig.setForceContentTypeDetection(boolean) and GenericHttpFetcherConfig.setForceCharsetDetection(boolean).

XML configuration entries expecting millisecond durations can be provided in human-readable format (English only), as per DurationParser (e.g., "5 minutes and 30 seconds" or "5m30s").

HSTS Support

Upon first encountering a secure site, this fetcher will check whether the site root domain has the "Strict-Transport-Security" (HSTS) policy support part of its HTTP response headers. That information gets cached for future requests. If the site supports HSTS, any non-secure URLs encountered on the same domain will be automatically converted to "https" (including sub-domains if HSTS indicates as such).

If you want to convert non-secure URLs secure ones regardless of website HSTS support, use GenericURLNormalizer.Normalization.secureScheme instead. To disable HSTS support, use GenericHttpFetcherConfig.setDisableHSTS(boolean).

Pro-active change detection

This fetcher takes advantage of the ETag and If-Modified-Since HTTP specifications.

On subsequent crawls, HTTP requests will include previously cached ETag and If-Modified-Since values to tell supporting servers we only want to download a document if it was modified since our last request. To disable support for pro-active change detection, you can use GenericHttpFetcherConfig.setDisableIfModifiedSince(boolean) and GenericHttpFetcherConfig.setDisableETag(boolean).

These settings have no effect for web servers not supporting them.

XML configuration usage:


<fetcher
    class="com.norconex.collector.http.fetch.impl.GenericHttpFetcher">
  <userAgent>(identify yourself!)</userAgent>
  <cookieSpec>
    [STANDARD|DEFAULT|IGNORE_COOKIES|NETSCAPE|STANDARD_STRICT]
  </cookieSpec>
  <connectionTimeout>(milliseconds)</connectionTimeout>
  <socketTimeout>(milliseconds)</socketTimeout>
  <connectionRequestTimeout>(milliseconds)</connectionRequestTimeout>
  <connectionCharset>...</connectionCharset>
  <expectContinueEnabled>[false|true]</expectContinueEnabled>
  <maxRedirects>...</maxRedirects>
  <redirectURLProvider>(implementation handling redirects)</redirectURLProvider>
  <localAddress>...</localAddress>
  <maxConnections>...</maxConnections>
  <maxConnectionsPerRoute>...</maxConnectionsPerRoute>
  <maxConnectionIdleTime>(milliseconds)</maxConnectionIdleTime>
  <maxConnectionInactiveTime>(milliseconds)</maxConnectionInactiveTime>
  <!-- Be warned: trusting all certificates is usually a bad idea. -->
  <trustAllSSLCertificates>[false|true]</trustAllSSLCertificates>
  <!-- You can specify SSL/TLS protocols to use -->
  <sslProtocols>(coma-separated list)</sslProtocols>
  <!-- Disable Server Name Indication (SNI) -->
  <disableSNI>[false|true]</disableSNI>
  <!-- Disable support for website "Strict-Transport-Security" setting. -->
  <disableHSTS>[false|true]</disableHSTS>
  <!-- You can use a specific key store for SSL Certificates -->
  <keyStoreFile/>
  <proxySettings/>
  <!-- HTTP request header constants passed on every HTTP requests -->
  <headers>
    <header
        name="(header name)">
      (header value)
    </header>
    <!-- You can repeat this header tag as needed. -->
  </headers>
  <!-- Disable conditionally getting a document based on last crawl date. -->
  <disableIfModifiedSince>[false|true]</disableIfModifiedSince>
  <!-- Disable ETag support. -->
  <disableETag>[false|true]</disableETag>
  <!-- Optional authentication details. -->
  <authentication>
    <method>[form|basic|digest|ntlm|spnego|kerberos]</method>
    <!-- These apply to any authentication mechanism -->
    <credentials/>
    <!-- These apply to FORM authentication -->
    <formUsernameField>...</formUsernameField>
    <formPasswordField>...</formPasswordField>
    <url>
      (Either a login form's action target URL or the URL of a page containing
       a login form if a "formSelector" is specified.)
    </url>
    <formCharset>...</formCharset>
    <!-- Extra form parameters required to authenticate (since 2.8.0) -->
    <formParams>
      <param
          name="(param name)">
        (param value)
      </param>
      <!-- You can repeat this param tag as needed. -->
    </formParams>
    <formSelector>
      (CSS selector identifying the login page. E.g., "form")
    </formSelector>
    <!-- These apply to both BASIC and DIGEST authentication -->
    <host/>
    <realm>...</realm>
    <!-- This applies to BASIC authentication -->
    <preemptive>[false|true]</preemptive>
    <!-- These apply to NTLM authentication -->
    <host/>
    <workstation>...</workstation>
    <domain>...</domain>
  </authentication>
  <validStatusCodes>(defaults to 200)</validStatusCodes>
  <notFoundStatusCodes>(defaults to 404)</notFoundStatusCodes>
  <headersPrefix>(string to prefix headers)</headersPrefix>
  <!-- Force detect, or only when not provided in HTTP response headers -->
  <forceContentTypeDetection>[false|true]</forceContentTypeDetection>
  <forceCharsetDetection>[false|true]</forceCharsetDetection>
  <referenceFilters>
    <!-- multiple "filter" tags allowed -->
    <filter
        class="(any reference filter class)">
      (Restrict usage of this fetcher to matching reference filters.
       Refer to the documentation for the IReferenceFilter implementation
       you are using here for usage details.)
    </filter>
  </referenceFilters>
  <!-- Comma-separated list of supported HTTP methods. -->
  <httpMethods>(defaults to: GET, HEAD)</httpMethods>
</fetcher>

XML usage example:


<fetcher
    class="GenericHttpFetcher">
  <authentication>
    <method>form</method>
    <credentials>
      <username>joeUser</username>
      <password>joePasword</password>
    </credentials>
    <formUsernameField>loginUser</formUsernameField>
    <formPasswordField>loginPwd</formPasswordField>
    <url>http://www.example.com/login/submit</url>
  </authentication>
</fetcher>

The above example will authenticate the crawler to a web site before crawling. The website uses an HTML form with a username and password fields called "loginUser" and "loginPwd".

Since:
3.0.0 (Merged from GenericDocumentFetcher and GenericHttpClientFactory)
Author:
Pascal Essiembre
  • Field Details

  • Constructor Details

    • GenericHttpFetcher

      public GenericHttpFetcher()
    • GenericHttpFetcher

      public GenericHttpFetcher(GenericHttpFetcherConfig httpFetcherConfig)
  • Method Details

    • getConfig

      public GenericHttpFetcherConfig getConfig()
    • getHttpClient

      public org.apache.http.client.HttpClient getHttpClient()
    • accept

      protected boolean accept(HttpMethod httpMethod)
      Description copied from class: AbstractHttpFetcher
      Whether the supplied HttpMethod is supported by this fetcher.
      Specified by:
      accept in class AbstractHttpFetcher
      Parameters:
      httpMethod - the HTTP method
      Returns:
      true if supported
    • fetcherStartup

      protected void fetcherStartup(HttpCollector c)
      Description copied from class: AbstractHttpFetcher
      Invoked once per fetcher instance, when the collector starts. Default implementation does nothing.
      Overrides:
      fetcherStartup in class AbstractHttpFetcher
      Parameters:
      c - collector
    • fetcherShutdown

      protected void fetcherShutdown(HttpCollector c)
      Description copied from class: AbstractHttpFetcher
      Invoked once per fetcher when the collector ends. Default implementation does nothing.
      Overrides:
      fetcherShutdown in class AbstractHttpFetcher
      Parameters:
      c - collector
    • getUserAgent

      public String getUserAgent()
    • fetch

      public IHttpFetchResponse fetch(CrawlDoc doc, HttpMethod httpMethod) throws HttpFetchException
      Description copied from interface: IHttpFetcher

      Performs an HTTP request for the supplied document reference and HTTP method.

      For each HTTP method supported, implementors should do their best to populate the document and its CrawlDocInfo with as much information they can.

      Unsupported HTTP methods should return an HTTP response with the CrawlState.UNSUPPORTED state. To prevent userse having to configure multiple HTTP clients, implementors should try to support both the GET and HEAD methods. POST is only used in special cases and is often not used during a crawl session.

      A null method is treated as a GET.

      Parameters:
      doc - document to fetch or to use to make the request.
      httpMethod - HTTP method
      Returns:
      an HTTP response
      Throws:
      HttpFetchException - problem when fetching the document
      See Also:
    • createHttpClient

      protected org.apache.http.client.HttpClient createHttpClient()
    • buildCustomHttpClient

      protected void buildCustomHttpClient(org.apache.http.impl.client.HttpClientBuilder builder)
      For implementors to subclass. Does nothing by default.
      Parameters:
      builder - http client builder
    • authenticateUsingForm

      protected void authenticateUsingForm(org.apache.http.client.HttpClient httpClient)
    • createDefaultCookieStore

      protected org.apache.http.client.CookieStore createDefaultCookieStore()
      Creates the default cookie store to be added to each request context.
      Returns:
      a cookie store
    • createDefaultRequestHeaders

      protected List<org.apache.http.Header> createDefaultRequestHeaders()

      Creates a list of HTTP headers based on configuration.

      This method will also add a "Basic" authentication header if "preemptive" is true on the authentication configuration and credentials were supplied.

      Returns:
      a list of HTTP request headers
    • createSchemePortResolver

      protected org.apache.http.conn.SchemePortResolver createSchemePortResolver()
    • createRequestConfig

      protected org.apache.http.client.config.RequestConfig createRequestConfig()
    • createProxy

      protected org.apache.http.HttpHost createProxy()
    • createCredentialsProvider

      protected org.apache.http.client.CredentialsProvider createCredentialsProvider()
    • createConnectionConfig

      protected org.apache.http.config.ConnectionConfig createConnectionConfig()
    • createSSLSocketFactory

      protected org.apache.http.conn.socket.LayeredConnectionSocketFactory createSSLSocketFactory(SSLContext sslContext)
    • createSSLContext

      protected SSLContext createSSLContext()
    • loadHttpFetcherFromXML

      public void loadHttpFetcherFromXML(XML xml)
      Specified by:
      loadHttpFetcherFromXML in class AbstractHttpFetcher
    • saveHttpFetcherToXML

      public void saveHttpFetcherToXML(XML xml)
      Specified by:
      saveHttpFetcherToXML in class AbstractHttpFetcher
    • equals

      public boolean equals(Object other)
      Overrides:
      equals in class AbstractHttpFetcher
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class AbstractHttpFetcher
    • toString

      public String toString()
      Overrides:
      toString in class AbstractHttpFetcher