Class GenericHttpFetcher

  • All Implemented Interfaces:
    IHttpFetcher, IEventListener<Event>, IXMLConfigurable, EventListener, Consumer<Event>

    public class GenericHttpFetcher
    extends AbstractHttpFetcher

    Default implementation of IHttpFetcher, based on Apache HttpClient.

    The "validStatusCodes" and "notFoundStatusCodes" configuration options expect a comma-separated list of HTTP response codes. If a code is added to both, the valid list takes precedence.

    Accepted HTTP methods

    By default this fetcher accepts HTTP GET and HEAD requests. You can limit it to only process one method with GenericHttpFetcherConfig.setHttpMethods(List).

    Content type and character encoding

    The default way for the HTTP Collector to identify the content type and character encoding of a document is to rely on the "Content-Type" HTTP response header. Web servers can sometimes return invalid or missing content type and character encoding information. You can optionally decide not to trust web servers HTTP responses and have the collector perform its own content type and encoding detection. Such detection can be enabled with GenericHttpFetcherConfig.setForceContentTypeDetection(boolean) and GenericHttpFetcherConfig.setForceCharsetDetection(boolean).

    XML configuration entries expecting millisecond durations can be provided in human-readable format (English only), as per DurationParser (e.g., "5 minutes and 30 seconds" or "5m30s").

    HSTS Support

    Upon first encountering a secure site, this fetcher will check whether the site root domain has the "Strict-Transport-Security" (HSTS) policy support part of its HTTP response headers. That information gets cached for future requests. If the site supports HSTS, any non-secure URLs encountered on the same domain will be automatically converted to "https" (including sub-domains if HSTS indicates as such).

    If you want to convert non-secure URLs secure ones regardless of website HSTS support, use GenericURLNormalizer.Normalization.secureScheme instead. To disable HSTS support, use GenericHttpFetcherConfig.setDisableHSTS(boolean).

    Pro-active change detection

    This fetcher takes advantage of the ETag and If-Modified-Since HTTP specifications.

    On subsequent crawls, HTTP requests will include previously cached ETag and If-Modified-Since values to tell supporting servers we only want to download a document if it was modified since our last request. To disable support for pro-active change detection, you can use GenericHttpFetcherConfig.setDisableIfModifiedSince(boolean) and GenericHttpFetcherConfig.setDisableETag(boolean).

    These settings have no effect for web servers not supporting them.

    XML configuration usage:

    
    <fetcher
        class="com.norconex.collector.http.fetch.impl.GenericHttpFetcher">
      <userAgent>(identify yourself!)</userAgent>
      <cookieSpec>
        [STANDARD|DEFAULT|IGNORE_COOKIES|NETSCAPE|STANDARD_STRICT]
      </cookieSpec>
      <connectionTimeout>(milliseconds)</connectionTimeout>
      <socketTimeout>(milliseconds)</socketTimeout>
      <connectionRequestTimeout>(milliseconds)</connectionRequestTimeout>
      <connectionCharset>...</connectionCharset>
      <expectContinueEnabled>[false|true]</expectContinueEnabled>
      <maxRedirects>...</maxRedirects>
      <redirectURLProvider>(implementation handling redirects)</redirectURLProvider>
      <localAddress>...</localAddress>
      <maxConnections>...</maxConnections>
      <maxConnectionsPerRoute>...</maxConnectionsPerRoute>
      <maxConnectionIdleTime>(milliseconds)</maxConnectionIdleTime>
      <maxConnectionInactiveTime>(milliseconds)</maxConnectionInactiveTime>
      <!-- Be warned: trusting all certificates is usually a bad idea. -->
      <trustAllSSLCertificates>[false|true]</trustAllSSLCertificates>
      <!-- You can specify SSL/TLS protocols to use -->
      <sslProtocols>(coma-separated list)</sslProtocols>
      <!-- Disable Server Name Indication (SNI) -->
      <disableSNI>[false|true]</disableSNI>
      <!-- Disable support for website "Strict-Transport-Security" setting. -->
      <disableHSTS>[false|true]</disableHSTS>
      <!-- You can use a specific key store for SSL Certificates -->
      <keyStoreFile/>
      <proxySettings/>
      <!-- HTTP request header constants passed on every HTTP requests -->
      <headers>
        <header
            name="(header name)">
          (header value)
        </header>
        <!-- You can repeat this header tag as needed. -->
      </headers>
      <!-- Disable conditionally getting a document based on last crawl date. -->
      <disableIfModifiedSince>[false|true]</disableIfModifiedSince>
      <!-- Disable ETag support. -->
      <disableETag>[false|true]</disableETag>
      <!-- Optional authentication details. -->
      <authentication>
        <method>[form|basic|digest|ntlm|spnego|kerberos]</method>
        <!-- These apply to any authentication mechanism -->
        <credentials/>
        <!-- These apply to FORM authentication -->
        <formUsernameField>...</formUsernameField>
        <formPasswordField>...</formPasswordField>
        <url>
          (Either a login form's action target URL or the URL of a page containing
           a login form if a "formSelector" is specified.)
        </url>
        <formCharset>...</formCharset>
        <!-- Extra form parameters required to authenticate (since 2.8.0) -->
        <formParams>
          <param
              name="(param name)">
            (param value)
          </param>
          <!-- You can repeat this param tag as needed. -->
        </formParams>
        <formSelector>
          (CSS selector identifying the login page. E.g., "form")
        </formSelector>
        <!-- These apply to both BASIC and DIGEST authentication -->
        <host/>
        <realm>...</realm>
        <!-- This applies to BASIC authentication -->
        <preemptive>[false|true]</preemptive>
        <!-- These apply to NTLM authentication -->
        <host/>
        <workstation>...</workstation>
        <domain>...</domain>
      </authentication>
      <validStatusCodes>(defaults to 200)</validStatusCodes>
      <notFoundStatusCodes>(defaults to 404)</notFoundStatusCodes>
      <headersPrefix>(string to prefix headers)</headersPrefix>
      <!-- Force detect, or only when not provided in HTTP response headers -->
      <forceContentTypeDetection>[false|true]</forceContentTypeDetection>
      <forceCharsetDetection>[false|true]</forceCharsetDetection>
      <referenceFilters>
        <!-- multiple "filter" tags allowed -->
        <filter
            class="(any reference filter class)">
          (Restrict usage of this fetcher to matching reference filters.
           Refer to the documentation for the IReferenceFilter implementation
           you are using here for usage details.)
        </filter>
      </referenceFilters>
      <!-- Comma-separated list of supported HTTP methods. -->
      <httpMethods>(defaults to: GET, HEAD)</httpMethods>
    </fetcher>

    XML usage example:

    
    <fetcher
        class="GenericHttpFetcher">
      <authentication>
        <method>form</method>
        <credentials>
          <username>joeUser</username>
          <password>joePasword</password>
        </credentials>
        <formUsernameField>loginUser</formUsernameField>
        <formPasswordField>loginPwd</formPasswordField>
        <url>http://www.example.com/login/submit</url>
      </authentication>
    </fetcher>

    The above example will authenticate the crawler to a web site before crawling. The website uses an HTML form with a username and password fields called "loginUser" and "loginPwd".

    Since:
    3.0.0 (Merged from GenericDocumentFetcher and GenericHttpClientFactory)
    Author:
    Pascal Essiembre
    • Constructor Detail

      • GenericHttpFetcher

        public GenericHttpFetcher()
    • Method Detail

      • getHttpClient

        public org.apache.http.client.HttpClient getHttpClient()
      • accept

        protected boolean accept​(HttpMethod httpMethod)
        Description copied from class: AbstractHttpFetcher
        Whether the supplied HttpMethod is supported by this fetcher.
        Specified by:
        accept in class AbstractHttpFetcher
        Parameters:
        httpMethod - the HTTP method
        Returns:
        true if supported
      • getUserAgent

        public String getUserAgent()
      • fetch

        public IHttpFetchResponse fetch​(CrawlDoc doc,
                                        HttpMethod httpMethod)
                                 throws HttpFetchException
        Description copied from interface: IHttpFetcher

        Performs an HTTP request for the supplied document reference and HTTP method.

        For each HTTP method supported, implementors should do their best to populate the document and its CrawlDocInfo with as much information they can.

        Unsupported HTTP methods should return an HTTP response with the CrawlState.UNSUPPORTED state. To prevent userse having to configure multiple HTTP clients, implementors should try to support both the GET and HEAD methods. POST is only used in special cases and is often not used during a crawl session.

        A null method is treated as a GET.

        Parameters:
        doc - document to fetch or to use to make the request.
        httpMethod - HTTP method
        Returns:
        an HTTP response
        Throws:
        HttpFetchException - problem when fetching the document
        See Also:
        HttpFetchResponseBuilder.unsupported()
      • createHttpClient

        protected org.apache.http.client.HttpClient createHttpClient()
      • buildCustomHttpClient

        protected void buildCustomHttpClient​(org.apache.http.impl.client.HttpClientBuilder builder)
        For implementors to subclass. Does nothing by default.
        Parameters:
        builder - http client builder
      • authenticateUsingForm

        protected void authenticateUsingForm​(org.apache.http.client.HttpClient httpClient)
      • createDefaultCookieStore

        protected org.apache.http.client.CookieStore createDefaultCookieStore()
        Creates the default cookie store to be added to each request context.
        Returns:
        a cookie store
      • createDefaultRequestHeaders

        protected List<org.apache.http.Header> createDefaultRequestHeaders()

        Creates a list of HTTP headers based on configuration.

        This method will also add a "Basic" authentication header if "preemptive" is true on the authentication configuration and credentials were supplied.

        Returns:
        a list of HTTP request headers
      • createSchemePortResolver

        protected org.apache.http.conn.SchemePortResolver createSchemePortResolver()
      • createRequestConfig

        protected org.apache.http.client.config.RequestConfig createRequestConfig()
      • createProxy

        protected org.apache.http.HttpHost createProxy()
      • createCredentialsProvider

        protected org.apache.http.client.CredentialsProvider createCredentialsProvider()
      • createConnectionConfig

        protected org.apache.http.config.ConnectionConfig createConnectionConfig()
      • createSSLSocketFactory

        protected org.apache.http.conn.socket.LayeredConnectionSocketFactory createSSLSocketFactory​(SSLContext sslContext)
      • createSSLContext

        protected SSLContext createSSLContext()