Class ApacheHttpUtil

java.lang.Object
com.norconex.collector.http.fetch.util.ApacheHttpUtil

public final class ApacheHttpUtil extends Object
Utility methods for fetcher implementations using Apache HttpClient.
Since:
3.0.0
Author:
Pascal Essiembre
  • Method Details

    • applyResponseContent

      public static boolean applyResponseContent(org.apache.http.HttpResponse response, CrawlDoc doc) throws IOException

      Applies the HTTP response content to a document if such content exists. The stream is fully downloaded and associated with a document.

      Parameters:
      response - the HTTP response
      doc - document to apply headers on
      Returns:
      true if there was content to apply
      Throws:
      IOException - could not read existing content
    • applyResponseHeaders

      public static void applyResponseHeaders(org.apache.http.HttpResponse response, String prefix, CrawlDoc doc)

      Applies the HTTP response headers to a document. This method will do its best to derive relevant information from the HTTP headers that can be set on the document HttpDocInfo:

      • Content type
      • Content encoding
      • ETag

      In addition, all HTTP headers will be added to the document metadata, with an optional prefix.

      Parameters:
      response - the HTTP response
      prefix - optional metadata prefix for all HTTP response headers
      doc - document to apply headers on
    • applyContentTypeAndCharset

      public static void applyContentTypeAndCharset(String value, CrawlDocInfo docInfo)
      Applies the Content-Type HTTP response header on the supplied document info. It does so by extracting both the content type and charset from the value, and sets them by invoking DocInfo.setContentType(ContentType) and DocInfo.setContentEncoding(String). This method is automatically invoked by applyResponseHeaders(HttpResponse, String, CrawlDoc) when encountering a content type header.
      Parameters:
      value - value to parse and set.
      docInfo - document info
    • setRequestIfModifiedSince

      public static void setRequestIfModifiedSince(org.apache.http.HttpRequest request, CrawlDoc doc)
      Sets the If-Modified-Since HTTP request header based on document cached last crawled date (if any).
      Parameters:
      request - HTTP request
      doc - document
    • setRequestIfNoneMatch

      public static void setRequestIfNoneMatch(org.apache.http.HttpRequest request, CrawlDoc doc)
      Sets the ETag If-None-Match HTTP request header based on document cached ETag value (if any).
      Parameters:
      request - HTTP request
      doc - document
    • createUriRequest

      public static org.apache.http.client.methods.HttpRequestBase createUriRequest(String url, String method)
      Creates an HTTP request.
      Parameters:
      url - the request target URL
      method - HTTP method (defaults to GET if null)
      Returns:
      Apache HTTP request
    • createUriRequest

      public static org.apache.http.client.methods.HttpRequestBase createUriRequest(String url, HttpMethod method)
      Creates an HTTP request.
      Parameters:
      url - the request target URL
      method - HTTP method (defaults to GET if null)
      Returns:
      Apache HTTP request
    • authenticateUsingForm

      public static void authenticateUsingForm(org.apache.http.client.HttpClient httpClient, HttpAuthConfig authConfig) throws IOException, URISyntaxException
      Throws:
      IOException
      URISyntaxException