public class WebDriverHttpFetcher extends AbstractHttpFetcher
Uses Selenium WebDriver support for using native browsers to crawl documents. Useful for crawling JavaScript-driven websites.
Relying on an external software to fetch pages can be slower and not as
scalable and may be less stable. Downloading of binaries and non-HTML file
format may not always be possible. The use of GenericHttpFetcher
should be preferred whenever possible. Use at your own risk.
This fetcher only supports HTTP GET method.
By default, web drivers do not expose HTTP headers from HTTP GET request. If you want to capture them, configure the "httpSniffer". A proxy service will be started to monitor HTTP traffic and store HTTP headers.
NOTE: Capturing headers with a proxy may not be supported by all Browsers/WebDriver implementations.
<fetcher
class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
<browser>[chrome|edge|firefox|opera|safari]</browser>
<!-- Local web driver settings -->
<browserPath>(browser executable or blank to detect)</browserPath>
<driverPath>(driver executable or blank to detect)</driverPath>
<!-- Remote web driver setting -->
<remoteURL>(URL of the remote web driver cluster)</remoteURL>
<!-- Optional browser capabilities supported by the web driver. -->
<capabilities>
<capability
name="(capability name)">
(capability value)
</capability>
<!-- multiple "capability" tags allowed -->
</capabilities>
<!-- Optionally take screenshots of each web pages. -->
<screenshot>
<cssSelector>(Optional selector of element to capture.)</cssSelector>
<targets>[metadata|directory] (One or both, separated by comma.)</targets>
<imageFormat>(Image format. Default is "png".)</imageFormat>
<!-- The following applies to the "directory" target: -->
<targetDir
field="(Document field to store the local path to the image.)"
structure="[url2path|date|datetime]">
(Local directory where to save images.)
</targetDir>
<!-- The following applies to the "metadata" target: -->
<targetMetaField>
(Document field where to store the image.)
</targetMetaField>
</screenshot>
<windowSize>(Optional. Browser window dimensions. E.g., 640x480)</windowSize>
<earlyPageScript>
(Optional JavaScript code to be run the moment a page is requested.)
</earlyPageScript>
<latePageScript>
(Optional JavaScript code to be run after we are done
waiting for a page.)
</latePageScript>
<!--
The following timeouts/waits are set in milliseconds or
- human-readable format (English). Default is zero (not set).
-->
<pageLoadTimeout>
(Web driver max wait time for a page to load.)
</pageLoadTimeout>
<implicitlyWait>
(Web driver max wait time for an element to appear. See
"waitForElement".)
</implicitlyWait>
<scriptTimeout>
(Web driver max wait time for a scripts to execute.)
</scriptTimeout>
<waitForElement
type="[tagName|className|cssSelector|id|linkText|name|partialLinkText|xpath]"
selector="(Reference to element, as per the type specified.)">
(Max wait time for an element to show up in browser before returning.
Default 'type' is 'tagName'.)
</waitForElement>
<threadWait>
(Makes the current thread sleep for the specified duration, to
give the web driver enough time to load the page.
Sometimes necessary for some web driver implementations if the above
options do not work.)
</threadWait>
<referenceFilters>
<!-- multiple "filter" tags allowed -->
<filter
class="(any reference filter class)">
(Restrict usage of this fetcher to matching reference filters.
Refer to the documentation for the IReferenceFilter implementation
you are using here for usage details.)
</filter>
</referenceFilters>
<!--
Optionally setup an HTTP proxy that allows to set and capture
HTTP headers. For advanced use only. Not recommended
for regular usage.
-->
<httpSniffer>
<port>(default is 0 = random free port)</port>
<userAgent>(optionally overwrite browser user agent)</userAgent>
<maxBufferSize>
(Maximum byte size before a request/response content is considered
too large. Can be specified using notations, e.g., 25MB. Default is 10MB)
</maxBufferSize>
<!-- Optional HTTP request headers passed on every HTTP requests -->
<headers>
<!-- You can repeat this header tag as needed. -->
<header
name="(header name)">
(header value)
</header>
</headers>
</httpSniffer>
</fetcher>
<fetcher
class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
<browser>firefox</browser>
<driverPath>/drivers/geckodriver.exe</driverPath>
<referenceFilters>
<filter
class="ReferenceFilter">
<valueMatcher
method="regex">
.*dynamic.*$
</valueMatcher>
</filter>
</referenceFilters>
</fetcher>
The above example will use Firefox to crawl dynamically generated pages using a specific web driver.
Constructor and Description |
---|
WebDriverHttpFetcher() |
WebDriverHttpFetcher(WebDriverHttpFetcherConfig config) |
Modifier and Type | Method and Description |
---|---|
protected boolean |
accept(HttpMethod httpMethod)
Whether the supplied HttpMethod is supported by this fetcher.
|
boolean |
equals(Object other) |
IHttpFetchResponse |
fetch(CrawlDoc doc,
HttpMethod httpMethod)
Performs an HTTP request for the supplied document reference
and HTTP method.
|
protected InputStream |
fetchDocumentContent(String url) |
protected void |
fetcherShutdown(HttpCollector c)
Invoked once per fetcher when the collector ends.
|
protected void |
fetcherStartup(HttpCollector c)
Invoked once per fetcher instance, when the collector starts.
|
protected void |
fetcherThreadBegin(HttpCrawler crawler)
Invoked each time a crawler begins a new crawler thread if that thread
is the current thread.
|
protected void |
fetcherThreadEnd(HttpCrawler crawler)
Invoked each time a crawler ends an existing crawler thread if that
thread is the current thread.
|
WebDriverHttpFetcherConfig |
getConfig() |
ScreenshotHandler |
getScreenshotHandler() |
String |
getUserAgent() |
protected org.openqa.selenium.WebDriver |
getWebDriver() |
int |
hashCode() |
void |
loadHttpFetcherFromXML(XML xml) |
void |
saveHttpFetcherToXML(XML xml) |
void |
setScreenshotHandler(ScreenshotHandler screenshotHandler) |
String |
toString() |
accept, accept, getReferenceFilters, loadFromXML, saveToXML, setReferenceFilters, setReferenceFilters
public WebDriverHttpFetcher()
public WebDriverHttpFetcher(WebDriverHttpFetcherConfig config)
public WebDriverHttpFetcherConfig getConfig()
protected boolean accept(HttpMethod httpMethod)
AbstractHttpFetcher
accept
in class AbstractHttpFetcher
httpMethod
- the HTTP methodtrue
if supportedpublic String getUserAgent()
public ScreenshotHandler getScreenshotHandler()
public void setScreenshotHandler(ScreenshotHandler screenshotHandler)
protected void fetcherStartup(HttpCollector c)
AbstractHttpFetcher
fetcherStartup
in class AbstractHttpFetcher
c
- collectorprotected void fetcherThreadBegin(HttpCrawler crawler)
AbstractHttpFetcher
fetcherThreadBegin
in class AbstractHttpFetcher
crawler
- crawlerprotected void fetcherThreadEnd(HttpCrawler crawler)
AbstractHttpFetcher
fetcherThreadEnd
in class AbstractHttpFetcher
crawler
- crawlerprotected void fetcherShutdown(HttpCollector c)
AbstractHttpFetcher
fetcherShutdown
in class AbstractHttpFetcher
c
- collectorpublic IHttpFetchResponse fetch(CrawlDoc doc, HttpMethod httpMethod) throws HttpFetchException
IHttpFetcher
Performs an HTTP request for the supplied document reference and HTTP method.
For each HTTP method supported, implementors should
do their best to populate the document and its CrawlDocInfo
with as much information they can.
Unsupported HTTP methods should return an HTTP response with the
CrawlState.UNSUPPORTED
state. To prevent userse having to
configure multiple HTTP clients, implementors should try to support
both the GET
and HEAD
methods.
POST is only used in special cases and is often not used during a
crawl session.
A null
method is treated as a GET
.
doc
- document to fetch or to use to make the request.httpMethod
- HTTP methodHttpFetchException
- problem when fetching the documentHttpFetchResponseBuilder.unsupported()
protected org.openqa.selenium.WebDriver getWebDriver()
protected InputStream fetchDocumentContent(String url)
public void loadHttpFetcherFromXML(XML xml)
loadHttpFetcherFromXML
in class AbstractHttpFetcher
public void saveHttpFetcherToXML(XML xml)
saveHttpFetcherToXML
in class AbstractHttpFetcher
public boolean equals(Object other)
equals
in class AbstractHttpFetcher
public int hashCode()
hashCode
in class AbstractHttpFetcher
public String toString()
toString
in class AbstractHttpFetcher
Copyright © 2009–2023 Norconex Inc.. All rights reserved.