java.lang.Object
- com.norconex.collector.http.fetch.AbstractHttpFetcher
- - com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher

All Implemented Interfaces:: IHttpFetcher, IEventListener<Event>, IXMLConfigurable, EventListener, Consumer<Event>

public class WebDriverHttpFetcher
extends AbstractHttpFetcher

Uses Selenium WebDriver support for using native browsers to crawl documents. Useful for crawling JavaScript-driven websites.

Considerations

Relying on an external software to fetch pages can be slower and not as scalable and may be less stable. Downloading of binaries and non-HTML file format may not always be possible. The use of GenericHttpFetcher should be preferred whenever possible. Use at your own risk.

Supported HTTP method

This fetcher only supports HTTP GET method.

HTTP Headers

By default, web drivers do not expose HTTP headers from HTTP GET request. If you want to capture them, configure the "httpSniffer". A proxy service will be started to monitor HTTP traffic and store HTTP headers.

NOTE: Capturing headers with a proxy may not be supported by all Browsers/WebDriver implementations.

XML configuration usage:


<fetcher
    class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
  <browser>[chrome|edge|firefox|opera|safari]</browser>
  <!-- Local web driver settings -->
  <browserPath>(browser executable or blank to detect)</browserPath>
  <driverPath>(driver executable or blank to detect)</driverPath>
  <!-- Remote web driver setting -->
  <remoteURL>(URL of the remote web driver cluster)</remoteURL>
  <!-- Optional browser capabilities supported by the web driver. -->
  <capabilities>
    <capability
        name="(capability name)">
      (capability value)
    </capability>
    <!-- multiple "capability" tags allowed -->
  </capabilities>
  <!-- Optional browser arguments for web drivers supporting them. -->
  <arguments>
    <arg>(argument value)</arg>
    <!-- multiple "arg" tags allowed -->
  </arguments>
  <!-- Optionally take screenshots of each web pages. -->
  <screenshot>
    <cssSelector>(Optional selector of element to capture.)</cssSelector>
    <targets>[metadata|directory] (One or both, separated by comma.)</targets>
    <imageFormat>(Image format. Default is "png".)</imageFormat>
    <!-- The following applies to the "directory" target: -->
    <targetDir
        field="(Document field to store the local path to the image.)"
        structure="[url2path|date|datetime]">
      (Local directory where to save images.)
    </targetDir>
    <!-- The following applies to the "metadata" target: -->
    <targetMetaField>
      (Document field where to store the image.)
    </targetMetaField>
  </screenshot>
  <windowSize>(Optional. Browser window dimensions. E.g., 640x480)</windowSize>
  <earlyPageScript>
    (Optional JavaScript code to be run the moment a page is requested.)
  </earlyPageScript>
  <latePageScript>
    (Optional JavaScript code to be run after we are done
     waiting for a page.)
  </latePageScript>
  <!--
    The following timeouts/waits are set in milliseconds or
          - human-readable format (English). Default is zero (not set).
    -->
  <pageLoadTimeout>
    (Web driver max wait time for a page to load.)
  </pageLoadTimeout>
  <implicitlyWait>
    (Web driver max wait time for an element to appear. See
     "waitForElement".)
  </implicitlyWait>
  <scriptTimeout>
    (Web driver max wait time for a scripts to execute.)
  </scriptTimeout>
  <waitForElement
      type="[tagName|className|cssSelector|id|linkText|name|partialLinkText|xpath]"
      selector="(Reference to element, as per the type specified.)">
    (Max wait time for an element to show up in browser before returning.
     Default 'type' is 'tagName'.)
  </waitForElement>
  <threadWait>
    (Makes the current thread sleep for the specified duration, to
    give the web driver enough time to load the page.
    Sometimes necessary for some web driver implementations if the above
    options do not work.)
  </threadWait>
  <referenceFilters>
    <!-- multiple "filter" tags allowed -->
    <filter
        class="(any reference filter class)">
      (Restrict usage of this fetcher to matching reference filters.
       Refer to the documentation for the IReferenceFilter implementation
       you are using here for usage details.)
    </filter>
  </referenceFilters>
  <!--
    Optionally setup an HTTP proxy that allows to set and capture
            HTTP headers. For advanced use only. Not recommended
            for regular usage.
    -->
  <httpSniffer>
    <port>(default is 0 = random free port)</port>
    <host>(default is "localhost")</host>
    <userAgent>(optionally overwrite browser user agent)</userAgent>
    <maxBufferSize>
      (Maximum byte size before a request/response content is considered
       too large. Can be specified using notations, e.g., 25MB.
       Zero or less means unlimited. Default is 10MB)
    </maxBufferSize>
    <responseTimeout>
      How long to wait for the HTTP response from the target host to be
      processed.
    </responseTimeout>
    <!-- Optional HTTP request headers passed on every HTTP requests -->
    <headers>
      <!-- You can repeat this header tag as needed. -->
      <header
          name="(header name)">
        (header value)
      </header>
    </headers>
    <!-- Optional chained proxy -->
    <chainedProxy/>
  </httpSniffer>
</fetcher>

XML configuration usage:


<fetcher
    class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
  <browser>firefox</browser>
  <driverPath>/drivers/geckodriver.exe</driverPath>
  <referenceFilters>
    <filter
        class="ReferenceFilter">
      <valueMatcher
          method="regex">
        .*dynamic.*$
      </valueMatcher>
    </filter>
  </referenceFilters>
</fetcher>

The above example will use Firefox to crawl dynamically generated pages using a specific web driver.

Since:: 3.0.0
Author:: Pascal Essiembre

Constructor Summary

Constructors
Constructor	Description
`WebDriverHttpFetcher()`	Creates a new WebDriver HTTP Fetcher defaulting to Firefox.
`WebDriverHttpFetcher(WebDriverHttpFetcherConfig config)`	Creates a new WebDriver HTTP Fetcher for the supplied configuration.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`protected boolean`	`accept(HttpMethod httpMethod)`	Whether the supplied HttpMethod is supported by this fetcher.
`boolean`	`equals(Object other)`
`IHttpFetchResponse`	`fetch(CrawlDoc doc, HttpMethod httpMethod)`	Performs an HTTP request for the supplied document reference and HTTP method.
`protected InputStream`	`fetchDocumentContent(String url)`
`protected void`	`fetcherShutdown(HttpCollector c)`	Invoked once per fetcher when the collector ends.
`protected void`	`fetcherStartup(HttpCollector c)`	Invoked once per fetcher instance, when the collector starts.
`protected void`	`fetcherThreadEnd(HttpCrawler crawler)`	Invoked each time a crawler ends an existing crawler thread if that thread is the current thread.
`WebDriverHttpFetcherConfig`	`getConfig()`
`ScreenshotHandler`	`getScreenshotHandler()`
`String`	`getUserAgent()`
`protected org.openqa.selenium.WebDriver`	`getWebDriver()`	Gets the web driver associated with the current thread or create one if none is found.
`int`	`hashCode()`
`void`	`loadHttpFetcherFromXML(XML xml)`
`void`	`saveHttpFetcherToXML(XML xml)`
`void`	`setScreenshotHandler(ScreenshotHandler screenshotHandler)`
`protected void`	`shutdownWebDriver()`
`String`	`toString()`

Methods inherited from class com.norconex.collector.http.fetch.AbstractHttpFetcher
accept, accept, fetcherThreadBegin, getReferenceFilters, loadFromXML, saveToXML, setReferenceFilters, setReferenceFilters

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Methods inherited from interface java.util.function.Consumer
andThen

- Constructor Detail
  - WebDriverHttpFetcher
```
public WebDriverHttpFetcher()
```
    Creates a new WebDriver HTTP Fetcher defaulting to Firefox.
  - WebDriverHttpFetcher
```
public WebDriverHttpFetcher(WebDriverHttpFetcherConfig config)
```
    Creates a new WebDriver HTTP Fetcher for the supplied configuration.
    
    Parameters:
    
    config - WebDriver configuration
- Method Detail
  - getConfig
```
public WebDriverHttpFetcherConfig getConfig()
```
  - accept
```
protected boolean accept(HttpMethod httpMethod)
```
    Description copied from class: AbstractHttpFetcher
    
    Whether the supplied HttpMethod is supported by this fetcher.
    
    Specified by:
    
    accept in class AbstractHttpFetcher
    
    Parameters:
    
    httpMethod - the HTTP method
    
    Returns:
    
    true if supported
  - getUserAgent
```
public String getUserAgent()
```
  - getScreenshotHandler
```
public ScreenshotHandler getScreenshotHandler()
```
  - setScreenshotHandler
```
public void setScreenshotHandler(ScreenshotHandler screenshotHandler)
```
  - fetcherStartup
```
protected void fetcherStartup(HttpCollector c)
```
    Description copied from class: AbstractHttpFetcher
    
    Invoked once per fetcher instance, when the collector starts. Default implementation does nothing.
    
    Overrides:
    
    fetcherStartup in class AbstractHttpFetcher
    
    Parameters:
    
    c - collector
  - fetch
```
public IHttpFetchResponse fetch(CrawlDoc doc,
                                HttpMethod httpMethod)
                         throws HttpFetchException
```
    Description copied from interface: IHttpFetcher
    
    Performs an HTTP request for the supplied document reference and HTTP method.
    
    For each HTTP method supported, implementors should do their best to populate the document and its CrawlDocInfo with as much information they can.
    
    Unsupported HTTP methods should return an HTTP response with the CrawlState.UNSUPPORTED state. To prevent userse having to configure multiple HTTP clients, implementors should try to support both the GET and HEAD methods. POST is only used in special cases and is often not used during a crawl session.
    
    A null method is treated as a GET.
    
    Parameters:
    
    doc - document to fetch or to use to make the request.
    
    httpMethod - HTTP method
    
    Returns:
    
    an HTTP response
    
    Throws:
    
    HttpFetchException - problem when fetching the document
    
    See Also:
    
    HttpFetchResponseBuilder.unsupported()
  - fetcherThreadEnd
```
protected void fetcherThreadEnd(HttpCrawler crawler)
```
    Description copied from class: AbstractHttpFetcher
    
    Invoked each time a crawler ends an existing crawler thread if that thread is the current thread. Default implementation does nothing.
    
    Overrides:
    
    fetcherThreadEnd in class AbstractHttpFetcher
    
    Parameters:
    
    crawler - crawler
  - fetcherShutdown
```
protected void fetcherShutdown(HttpCollector c)
```
    Description copied from class: AbstractHttpFetcher
    
    Invoked once per fetcher when the collector ends. Default implementation does nothing.
    
    Overrides:
    
    fetcherShutdown in class AbstractHttpFetcher
    
    Parameters:
    
    c - collector
  - shutdownWebDriver
```
protected void shutdownWebDriver()
```
  - getWebDriver
```
protected org.openqa.selenium.WebDriver getWebDriver()
```
    Gets the web driver associated with the current thread or create one if none is found. Prior to 3.1.0, this method could return null.
    
    Returns:
    
    web driver (never null)
  - fetchDocumentContent
```
protected InputStream fetchDocumentContent(String url)
```
  - loadHttpFetcherFromXML
```
public void loadHttpFetcherFromXML(XML xml)
```
    Specified by:
    
    loadHttpFetcherFromXML in class AbstractHttpFetcher
  - saveHttpFetcherToXML
```
public void saveHttpFetcherToXML(XML xml)
```
    Specified by:
    
    saveHttpFetcherToXML in class AbstractHttpFetcher
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractHttpFetcher
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractHttpFetcher
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractHttpFetcher

Class WebDriverHttpFetcher

Considerations

Supported HTTP method

HTTP Headers

XML configuration usage:

XML configuration usage:

Constructor Summary

Method Summary

Methods inherited from class com.norconex.collector.http.fetch.AbstractHttpFetcher

Methods inherited from class java.lang.Object

Methods inherited from interface java.util.function.Consumer

Constructor Detail

WebDriverHttpFetcher

WebDriverHttpFetcher

Method Detail

getConfig

accept

getUserAgent

getScreenshotHandler

setScreenshotHandler

fetcherStartup

fetch

fetcherThreadEnd

fetcherShutdown

shutdownWebDriver

getWebDriver

fetchDocumentContent

loadHttpFetcherFromXML

saveHttpFetcherToXML

equals

hashCode

toString