Class WebDriverHttpFetcher

  • All Implemented Interfaces:
    IHttpFetcher, IEventListener<Event>, IXMLConfigurable, EventListener, Consumer<Event>

    public class WebDriverHttpFetcher
    extends AbstractHttpFetcher

    Uses Selenium WebDriver support for using native browsers to crawl documents. Useful for crawling JavaScript-driven websites.

    Considerations

    Relying on an external software to fetch pages can be slower and not as scalable and may be less stable. Downloading of binaries and non-HTML file format may not always be possible. The use of GenericHttpFetcher should be preferred whenever possible. Use at your own risk.

    Supported HTTP method

    This fetcher only supports HTTP GET method.

    HTTP Headers

    By default, web drivers do not expose HTTP headers from HTTP GET request. If you want to capture them, configure the "httpSniffer". A proxy service will be started to monitor HTTP traffic and store HTTP headers.

    NOTE: Capturing headers with a proxy may not be supported by all Browsers/WebDriver implementations.

    XML configuration usage:

    
    <fetcher
        class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
      <browser>[chrome|edge|firefox|opera|safari]</browser>
      <!-- Local web driver settings -->
      <browserPath>(browser executable or blank to detect)</browserPath>
      <driverPath>(driver executable or blank to detect)</driverPath>
      <!-- Remote web driver setting -->
      <remoteURL>(URL of the remote web driver cluster)</remoteURL>
      <!-- Optional browser capabilities supported by the web driver. -->
      <capabilities>
        <capability
            name="(capability name)">
          (capability value)
        </capability>
        <!-- multiple "capability" tags allowed -->
      </capabilities>
      <!-- Optional browser arguments for web drivers supporting them. -->
      <arguments>
        <arg>(argument value)</arg>
        <!-- multiple "arg" tags allowed -->
      </arguments>
      <!-- Optionally take screenshots of each web pages. -->
      <screenshot>
        <cssSelector>(Optional selector of element to capture.)</cssSelector>
        <targets>[metadata|directory] (One or both, separated by comma.)</targets>
        <imageFormat>(Image format. Default is "png".)</imageFormat>
        <!-- The following applies to the "directory" target: -->
        <targetDir
            field="(Document field to store the local path to the image.)"
            structure="[url2path|date|datetime]">
          (Local directory where to save images.)
        </targetDir>
        <!-- The following applies to the "metadata" target: -->
        <targetMetaField>
          (Document field where to store the image.)
        </targetMetaField>
      </screenshot>
      <windowSize>(Optional. Browser window dimensions. E.g., 640x480)</windowSize>
      <earlyPageScript>
        (Optional JavaScript code to be run the moment a page is requested.)
      </earlyPageScript>
      <latePageScript>
        (Optional JavaScript code to be run after we are done
         waiting for a page.)
      </latePageScript>
      <!--
        The following timeouts/waits are set in milliseconds or
              - human-readable format (English). Default is zero (not set).
        -->
      <pageLoadTimeout>
        (Web driver max wait time for a page to load.)
      </pageLoadTimeout>
      <implicitlyWait>
        (Web driver max wait time for an element to appear. See
         "waitForElement".)
      </implicitlyWait>
      <scriptTimeout>
        (Web driver max wait time for a scripts to execute.)
      </scriptTimeout>
      <waitForElement
          type="[tagName|className|cssSelector|id|linkText|name|partialLinkText|xpath]"
          selector="(Reference to element, as per the type specified.)">
        (Max wait time for an element to show up in browser before returning.
         Default 'type' is 'tagName'.)
      </waitForElement>
      <threadWait>
        (Makes the current thread sleep for the specified duration, to
        give the web driver enough time to load the page.
        Sometimes necessary for some web driver implementations if the above
        options do not work.)
      </threadWait>
      <referenceFilters>
        <!-- multiple "filter" tags allowed -->
        <filter
            class="(any reference filter class)">
          (Restrict usage of this fetcher to matching reference filters.
           Refer to the documentation for the IReferenceFilter implementation
           you are using here for usage details.)
        </filter>
      </referenceFilters>
      <!--
        Optionally setup an HTTP proxy that allows to set and capture
                HTTP headers. For advanced use only. Not recommended
                for regular usage.
        -->
      <httpSniffer>
        <port>(default is 0 = random free port)</port>
        <host>(default is "localhost")</host>
        <userAgent>(optionally overwrite browser user agent)</userAgent>
        <maxBufferSize>
          (Maximum byte size before a request/response content is considered
           too large. Can be specified using notations, e.g., 25MB. Default is 10MB)
        </maxBufferSize>
        <!-- Optional HTTP request headers passed on every HTTP requests -->
        <headers>
          <!-- You can repeat this header tag as needed. -->
          <header
              name="(header name)">
            (header value)
          </header>
        </headers>
        <!-- Optional chained proxy -->
        <chainedProxy/>
      </httpSniffer>
    </fetcher>

    XML configuration usage:

    
    <fetcher
        class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
      <browser>firefox</browser>
      <driverPath>/drivers/geckodriver.exe</driverPath>
      <referenceFilters>
        <filter
            class="ReferenceFilter">
          <valueMatcher
              method="regex">
            .*dynamic.*$
          </valueMatcher>
        </filter>
      </referenceFilters>
    </fetcher>

    The above example will use Firefox to crawl dynamically generated pages using a specific web driver.

    Since:
    3.0.0
    Author:
    Pascal Essiembre
    • Constructor Detail

      • WebDriverHttpFetcher

        public WebDriverHttpFetcher()
        Creates a new WebDriver HTTP Fetcher defaulting to Firefox.
      • WebDriverHttpFetcher

        public WebDriverHttpFetcher​(WebDriverHttpFetcherConfig config)
        Creates a new WebDriver HTTP Fetcher for the supplied configuration.
        Parameters:
        config - WebDriver configuration