Class WebDriverHttpFetcher
- java.lang.Object
-
- com.norconex.collector.http.fetch.AbstractHttpFetcher
-
- com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher
-
- All Implemented Interfaces:
IHttpFetcher
,IEventListener<Event>
,IXMLConfigurable
,EventListener
,Consumer<Event>
public class WebDriverHttpFetcher extends AbstractHttpFetcher
Uses Selenium WebDriver support for using native browsers to crawl documents. Useful for crawling JavaScript-driven websites.
Considerations
Relying on an external software to fetch pages can be slower and not as scalable and may be less stable. Downloading of binaries and non-HTML file format may not always be possible. The use of
GenericHttpFetcher
should be preferred whenever possible. Use at your own risk.Supported HTTP method
This fetcher only supports HTTP GET method.
HTTP Headers
By default, web drivers do not expose HTTP headers from HTTP GET request. If you want to capture them, configure the "httpSniffer". A proxy service will be started to monitor HTTP traffic and store HTTP headers.
NOTE: Capturing headers with a proxy may not be supported by all Browsers/WebDriver implementations.
XML configuration usage:
<fetcher class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher"> <browser>[chrome|edge|firefox|opera|safari]</browser> <!-- Local web driver settings --> <browserPath>(browser executable or blank to detect)</browserPath> <driverPath>(driver executable or blank to detect)</driverPath> <!-- Remote web driver setting --> <remoteURL>(URL of the remote web driver cluster)</remoteURL> <!-- Optional browser capabilities supported by the web driver. --> <capabilities> <capability name="(capability name)"> (capability value) </capability> <!-- multiple "capability" tags allowed --> </capabilities> <!-- Optional browser arguments for web drivers supporting them. --> <arguments> <arg>(argument value)</arg> <!-- multiple "arg" tags allowed --> </arguments> <!-- Optionally take screenshots of each web pages. --> <screenshot> <cssSelector>(Optional selector of element to capture.)</cssSelector> <targets>[metadata|directory] (One or both, separated by comma.)</targets> <imageFormat>(Image format. Default is "png".)</imageFormat> <!-- The following applies to the "directory" target: --> <targetDir field="(Document field to store the local path to the image.)" structure="[url2path|date|datetime]"> (Local directory where to save images.) </targetDir> <!-- The following applies to the "metadata" target: --> <targetMetaField> (Document field where to store the image.) </targetMetaField> </screenshot> <windowSize>(Optional. Browser window dimensions. E.g., 640x480)</windowSize> <earlyPageScript> (Optional JavaScript code to be run the moment a page is requested.) </earlyPageScript> <latePageScript> (Optional JavaScript code to be run after we are done waiting for a page.) </latePageScript> <!-- The following timeouts/waits are set in milliseconds or - human-readable format (English). Default is zero (not set). --> <pageLoadTimeout> (Web driver max wait time for a page to load.) </pageLoadTimeout> <implicitlyWait> (Web driver max wait time for an element to appear. See "waitForElement".) </implicitlyWait> <scriptTimeout> (Web driver max wait time for a scripts to execute.) </scriptTimeout> <waitForElement type="[tagName|className|cssSelector|id|linkText|name|partialLinkText|xpath]" selector="(Reference to element, as per the type specified.)"> (Max wait time for an element to show up in browser before returning. Default 'type' is 'tagName'.) </waitForElement> <threadWait> (Makes the current thread sleep for the specified duration, to give the web driver enough time to load the page. Sometimes necessary for some web driver implementations if the above options do not work.) </threadWait> <referenceFilters> <!-- multiple "filter" tags allowed --> <filter class="(any reference filter class)"> (Restrict usage of this fetcher to matching reference filters. Refer to the documentation for the IReferenceFilter implementation you are using here for usage details.) </filter> </referenceFilters> <!-- Optionally setup an HTTP proxy that allows to set and capture HTTP headers. For advanced use only. Not recommended for regular usage. --> <httpSniffer> <port>(default is 0 = random free port)</port> <host>(default is "localhost")</host> <userAgent>(optionally overwrite browser user agent)</userAgent> <maxBufferSize> (Maximum byte size before a request/response content is considered too large. Can be specified using notations, e.g., 25MB. Default is 10MB) </maxBufferSize> <!-- Optional HTTP request headers passed on every HTTP requests --> <headers> <!-- You can repeat this header tag as needed. --> <header name="(header name)"> (header value) </header> </headers> <!-- Optional chained proxy --> <chainedProxy/> </httpSniffer> </fetcher>
XML configuration usage:
<fetcher class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher"> <browser>firefox</browser> <driverPath>/drivers/geckodriver.exe</driverPath> <referenceFilters> <filter class="ReferenceFilter"> <valueMatcher method="regex"> .*dynamic.*$ </valueMatcher> </filter> </referenceFilters> </fetcher>
The above example will use Firefox to crawl dynamically generated pages using a specific web driver.
- Since:
- 3.0.0
- Author:
- Pascal Essiembre
-
-
Constructor Summary
Constructors Constructor Description WebDriverHttpFetcher()
Creates a new WebDriver HTTP Fetcher defaulting to Firefox.WebDriverHttpFetcher(WebDriverHttpFetcherConfig config)
Creates a new WebDriver HTTP Fetcher for the supplied configuration.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected boolean
accept(HttpMethod httpMethod)
Whether the supplied HttpMethod is supported by this fetcher.boolean
equals(Object other)
IHttpFetchResponse
fetch(CrawlDoc doc, HttpMethod httpMethod)
Performs an HTTP request for the supplied document reference and HTTP method.protected InputStream
fetchDocumentContent(String url)
protected void
fetcherShutdown(HttpCollector c)
Invoked once per fetcher when the collector ends.protected void
fetcherStartup(HttpCollector c)
Invoked once per fetcher instance, when the collector starts.protected void
fetcherThreadBegin(HttpCrawler crawler)
Invoked each time a crawler begins a new crawler thread if that thread is the current thread.protected void
fetcherThreadEnd(HttpCrawler crawler)
Invoked each time a crawler ends an existing crawler thread if that thread is the current thread.WebDriverHttpFetcherConfig
getConfig()
ScreenshotHandler
getScreenshotHandler()
String
getUserAgent()
protected org.openqa.selenium.WebDriver
getWebDriver()
Gets the web driver associated with the current thread (if any).int
hashCode()
void
loadHttpFetcherFromXML(XML xml)
void
saveHttpFetcherToXML(XML xml)
void
setScreenshotHandler(ScreenshotHandler screenshotHandler)
String
toString()
-
Methods inherited from class com.norconex.collector.http.fetch.AbstractHttpFetcher
accept, accept, getReferenceFilters, loadFromXML, saveToXML, setReferenceFilters, setReferenceFilters
-
-
-
-
Constructor Detail
-
WebDriverHttpFetcher
public WebDriverHttpFetcher()
Creates a new WebDriver HTTP Fetcher defaulting to Firefox.
-
WebDriverHttpFetcher
public WebDriverHttpFetcher(WebDriverHttpFetcherConfig config)
Creates a new WebDriver HTTP Fetcher for the supplied configuration.- Parameters:
config
- WebDriver configuration
-
-
Method Detail
-
getConfig
public WebDriverHttpFetcherConfig getConfig()
-
accept
protected boolean accept(HttpMethod httpMethod)
Description copied from class:AbstractHttpFetcher
Whether the supplied HttpMethod is supported by this fetcher.- Specified by:
accept
in classAbstractHttpFetcher
- Parameters:
httpMethod
- the HTTP method- Returns:
true
if supported
-
getUserAgent
public String getUserAgent()
-
getScreenshotHandler
public ScreenshotHandler getScreenshotHandler()
-
setScreenshotHandler
public void setScreenshotHandler(ScreenshotHandler screenshotHandler)
-
fetcherStartup
protected void fetcherStartup(HttpCollector c)
Description copied from class:AbstractHttpFetcher
Invoked once per fetcher instance, when the collector starts. Default implementation does nothing.- Overrides:
fetcherStartup
in classAbstractHttpFetcher
- Parameters:
c
- collector
-
fetcherThreadBegin
protected void fetcherThreadBegin(HttpCrawler crawler)
Description copied from class:AbstractHttpFetcher
Invoked each time a crawler begins a new crawler thread if that thread is the current thread. Default implementation does nothing.- Overrides:
fetcherThreadBegin
in classAbstractHttpFetcher
- Parameters:
crawler
- crawler
-
fetch
public IHttpFetchResponse fetch(CrawlDoc doc, HttpMethod httpMethod) throws HttpFetchException
Description copied from interface:IHttpFetcher
Performs an HTTP request for the supplied document reference and HTTP method.
For each HTTP method supported, implementors should do their best to populate the document and its
CrawlDocInfo
with as much information they can.Unsupported HTTP methods should return an HTTP response with the
CrawlState.UNSUPPORTED
state. To prevent userse having to configure multiple HTTP clients, implementors should try to support both theGET
andHEAD
methods. POST is only used in special cases and is often not used during a crawl session.A
null
method is treated as aGET
.- Parameters:
doc
- document to fetch or to use to make the request.httpMethod
- HTTP method- Returns:
- an HTTP response
- Throws:
HttpFetchException
- problem when fetching the document- See Also:
HttpFetchResponseBuilder.unsupported()
-
fetcherThreadEnd
protected void fetcherThreadEnd(HttpCrawler crawler)
Description copied from class:AbstractHttpFetcher
Invoked each time a crawler ends an existing crawler thread if that thread is the current thread. Default implementation does nothing.- Overrides:
fetcherThreadEnd
in classAbstractHttpFetcher
- Parameters:
crawler
- crawler
-
fetcherShutdown
protected void fetcherShutdown(HttpCollector c)
Description copied from class:AbstractHttpFetcher
Invoked once per fetcher when the collector ends. Default implementation does nothing.- Overrides:
fetcherShutdown
in classAbstractHttpFetcher
- Parameters:
c
- collector
-
getWebDriver
protected org.openqa.selenium.WebDriver getWebDriver()
Gets the web driver associated with the current thread (if any).- Returns:
- web driver or
null
-
fetchDocumentContent
protected InputStream fetchDocumentContent(String url)
-
loadHttpFetcherFromXML
public void loadHttpFetcherFromXML(XML xml)
- Specified by:
loadHttpFetcherFromXML
in classAbstractHttpFetcher
-
saveHttpFetcherToXML
public void saveHttpFetcherToXML(XML xml)
- Specified by:
saveHttpFetcherToXML
in classAbstractHttpFetcher
-
equals
public boolean equals(Object other)
- Overrides:
equals
in classAbstractHttpFetcher
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classAbstractHttpFetcher
-
toString
public String toString()
- Overrides:
toString
in classAbstractHttpFetcher
-
-