WebDriverHttpFetcher
@Deprecated public class PhantomJSDocumentFetcher extends AbstractHttpFetcher
PhantomJS headless browser is no longer maintained by its owner.
As such, starting with version 3.0.0, use of PhantomJSDocumentFetcher is
strongly discouraged and HttpClientProxy support for it has been dropped.
With more popular browsers (e.g. Chrome) now supporting operating
in headless mode, we now have more stable options. Please consider
using WebDriverHttpFetcher
instead when attempting to crawl
a JavaScript-driven website.
An alternative to the GenericHttpFetcher
which relies on an
external PhantomJS installation
to fetch web pages. While less efficient, this implementation is meant
to provide some way to crawl sites making heavy use of JavaScript to render
their pages. This class tells the PhantomJS headless browser to wait a
certain amount of time for the page to load extra content via Ajax requests
before grabbing all loaded HTML.
Relying on an external software to fetch pages is slower and not as
scalable and may be less stable. The use of GenericHttpFetcher
should be preferred whenever possible. Use at your own risk.
Use PhantomJS 2.1 (or possibly higher).
It is usually only useful to use PhantomJS for HTML pages with JavaScript.
Other types of documents are fetched using an instance of
GenericHttpFetcher
To find out if we are dealing with an HTML
documents, this fetcher needs to know the content type first.
By default, the content type
of a document is not known before a physical copy is obtained.
This means PhantomJS has to first download the document and if it is not an
HTML document at that point, it will be re-downloaded again with the generic
document fetcher.
By default, these content-types are considered HTML:
text/html, application/xhtml+xml, application/vnd.wap.xhtml+xml, application/x-asp
Those can be overwritten with setContentTypePattern(String)
.
To avoid downloading the document twice as described above, you can
configure a metadata fetcher (such as GenericHttpFetcher
). This
will attempt get the content type by first making an HTTP HEAD request.
Alternatively, if you have a URL pattern that identifies your HTML pages
(and only HTML pages), you can specify it using
setReferencePattern(String)
. Only URLs matching the provided
regular expression will be fetched by PhantomJS. By default there is no
pattern for discriminating on URL references.
Thanks to PhantomJS, one can save images of pages being crawled, including those rendered with JavaScript!
Since 2.8.0, you have to explicitely enabled screenshots with
setScreenshotEnabled(boolean)
. Also screenshots now share the same
size by default.
In addition, you can now control how screenshots are resized and how
they are stored stored.
Storage options:
collector.featured-image-inline
field.
The string is ready to be
used inline, in a <img src="..."> tag.
collector.featured-image-path
field.
Since 2.8.0, it is possible to specify a resource timeout so that slow individual page resources do not cause PhantomJS to hang for a long time.
Since 2.9.1, it is possible to specify which PhantomJS exit values
are to be considered "valid". Use a comma-separated-list of integers using
the setValidExitCodes(int...)
method. By default, only zero is
considered valid.
XML configuration entries expecting millisecond durations
can be provided in human-readable format (English only), as per
DurationParser
(e.g., "5 minutes and 30 seconds" or "5m30s").
<documentFetcher class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher" detectContentType="[false|true]" detectCharset="[false|true]" screenshotEnabled="[false|true]"> <exePath>(path to PhantomJS executable)</exePath> <scriptPath> (Optional path to a PhantomJS script. Defaults to scripts/phantom.js) </scriptPath> <renderWaitTime> (Milliseconds to wait for the entire page to load. Defaults to 3000, i.e., 3 seconds.) </renderWaitTime> <resourceTimeout> (Optional Milliseconds to wait for a page resource to load. Defaults is unspecified.) </resourceTimeout> <options> <opt>(optional extra PhantomJS command-line option)</opt> <!-- You can have multiple opt tags --> </options> <referencePattern> (Regular expression matching URLs for which to use the PhantomJS browser. Non-matching URLs will fallback to using GenericDocumentFetcher.) </referencePattern> <contentTypePattern> (Regular expression matching content types for which to use the PhantomJS browser. Non-matching content types will use the GenericDocumentFetcher.) </contentTypePattern> <validExitCodes>(defaults to 0)</validExitCodes> <validStatusCodes>(defaults to 200)</validStatusCodes> <notFoundStatusCodes>(defaults to 404)</notFoundStatusCodes> <headersPrefix>(string to prefix headers)</headersPrefix> <!-- Only applicable when screenshotEnabled is true: --> <screenshotDimensions> (Pixel size of the browser page area to capture: [width]x[height]. E.g., 800x600. Only used when a screenshot path is specified. Default is undefined. It will try to load all it can and may produce vertically long images.) </screenshotDimensions> <screenshotZoomFactor> (A decimal value to scale the screenshot image. E.g., 0.25 will make the image 25% its regular size, which is 25% of the above dimension if specified. Default is 1, i.e., 100%) </screenshotZoomFactor> <screenshotScaleDimensions> (Target pixel size the main image should be scaled to. Default is 300.) </screenshotScaleDimensions> <screenshotScaleStretch> [false|true] (Whether to stretch to match scale size. Default keeps aspect ratio.) </screenshotScaleStretch> <screenshotScaleQuality> [auto|low|medium|high|max] (Default is "auto", which tries the best balance between quality and speed based on image size. The lower the quality the faster it is to scale images.) </screenshotScaleQuality> <screenshotImageFormat> (Target format of stored image. E.g., "jpg", "png", "gif", "bmp", ... Default is "png") </screenshotImageFormat> <screenshotStorage> [disk|inline] (One or both, comma-separated. Default is "disk".) </screenshotStorage> <!-- Only applicable for "disk" storage: --> <screenshotStorageDiskDir structure="[url2path|date|datetime]"> (Path where to save screenshots.) </screenshotStorageDiskDir> <screenshotStorageDiskField> (Overwrite default field where to store the screenshot path.) </screenshotStorageDiskField> <!-- Only applicable for "inline" storage: --> <screenshotStorageInlineField> (Overwrite default field where to store the inline screenshot.) </screenshotStorageInlineField> </documentFetcher>
When specifying an image size, the format is [width]x[height]
or a single value. When a single value is used, that value represents both
the width and height (i.e., a square).
The "validStatusCodes" and "notFoundStatusCodes" elements expect a coma-separated list of HTTP response code. If a code is added in both elements, the valid list takes precedence.
The following configures HTTP Collector to use PhantomJS with a proxy to use HttpClient, only for URLs ending with ".html".
<httpcollector id="MyHttpCollector"> ... <crawlers> <crawler id="MyCrawler"> ... <documentFetcher class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher"> <exePath>/path/to/phantomjs.exe</exePath> <renderWaitTime>5000</renderWaitTime> <referencePattern>^.*\.html$</referencePattern> </documentFetcher> ... </crawler> </crawlers> ... <!-- Only if you need to use the HttpClient proxy (see documentation): --> <collectorListeners> <listener class="com.norconex.collector.http.fetch.impl.HttpClientProxyCollectorListener" /> </collectorListeners> </httpcollector>
Modifier and Type | Class and Description |
---|---|
static class |
PhantomJSDocumentFetcher.Quality
Deprecated.
|
static class |
PhantomJSDocumentFetcher.Storage
Deprecated.
|
static class |
PhantomJSDocumentFetcher.StorageDiskStructure
Deprecated.
|
Modifier and Type | Field and Description |
---|---|
static String |
COLLECTOR_PHANTOMJS_SCREENSHOT_INLINE
Deprecated.
|
static String |
COLLECTOR_PHANTOMJS_SCREENSHOT_PATH
Deprecated.
|
static String |
DEFAULT_CONTENT_TYPE_PATTERN
Deprecated.
|
static int |
DEFAULT_RENDER_WAIT_TIME
Deprecated.
|
static String |
DEFAULT_SCREENSHOT_IMAGE_FORMAT
Deprecated.
|
static Dimension |
DEFAULT_SCREENSHOT_SCALE_SIZE
Deprecated.
|
static PhantomJSDocumentFetcher.Storage |
DEFAULT_SCREENSHOT_STORAGE
Deprecated.
|
static String |
DEFAULT_SCREENSHOT_STORAGE_DISK_DIR
Deprecated.
|
static float |
DEFAULT_SCREENSHOT_ZOOM_FACTOR
Deprecated.
|
static String |
DEFAULT_SCRIPT_PATH
Deprecated.
|
Constructor and Description |
---|
PhantomJSDocumentFetcher()
Deprecated.
|
PhantomJSDocumentFetcher(int[] validStatusCodes)
Deprecated.
|
Modifier and Type | Method and Description |
---|---|
boolean |
accept(Doc doc,
HttpMethod httpMethod)
Deprecated.
|
protected boolean |
accept(HttpMethod httpMethod)
Deprecated.
Whether the supplied HttpMethod is supported by this fetcher.
|
boolean |
equals(Object other)
Deprecated.
|
IHttpFetchResponse |
fetch(CrawlDoc doc,
HttpMethod httpMethod)
Deprecated.
Performs an HTTP request for the supplied document reference
and HTTP method.
|
String |
getContentTypePattern()
Deprecated.
|
String |
getExePath()
Deprecated.
|
String |
getHeadersPrefix()
Deprecated.
|
List<Integer> |
getNotFoundStatusCodes()
Deprecated.
Gets HTTP status codes to be considered as "Not found" state.
|
List<String> |
getOptions()
Deprecated.
|
String |
getReferencePattern()
Deprecated.
|
int |
getRenderWaitTime()
Deprecated.
|
int |
getResourceTimeout()
Deprecated.
Gets the milliseconds timeout after which any resource requested will
stop trying and proceed with other parts of the page.
|
Dimension |
getScreenshotDimensions()
Deprecated.
|
String |
getScreenshotImageFormat()
Deprecated.
Gets the screenshot image format (jpg, png, gif, bmp, etc.).
|
Dimension |
getScreenshotScaleDimensions()
Deprecated.
Gets the pixel dimensions we want the stored screenshot to have.
|
PhantomJSDocumentFetcher.Quality |
getScreenshotScaleQuality()
Deprecated.
Gets the screenshot scaling quality to use when when storage
is "disk" or "inline".
|
List<PhantomJSDocumentFetcher.Storage> |
getScreenshotStorage()
Deprecated.
Gets the screenshot storage mechanisms.
|
String |
getScreenshotStorageDiskDir()
Deprecated.
Gets the directory where screenshots are saved when storage is "disk".
|
String |
getScreenshotStorageDiskField()
Deprecated.
Gets the target document metadata field where to store the path
to thescreen shot image file when storage is "disk".
|
PhantomJSDocumentFetcher.StorageDiskStructure |
getScreenshotStorageDiskStructure()
Deprecated.
Gets the screenshot directory structure to create when storage
is "disk".
|
String |
getScreenshotStorageInlineField()
Deprecated.
Gets the target document metadata field where to store the inline
(Base64) screenshot image when storage is "inline".
|
float |
getScreenshotZoomFactor()
Deprecated.
|
String |
getScriptPath()
Deprecated.
|
String |
getUserAgent()
Deprecated.
|
List<Integer> |
getValidExitCodes()
Deprecated.
Sets valid PhantomJS exit values (defaults to 0).
|
List<Integer> |
getValidStatusCodes()
Deprecated.
|
int |
hashCode()
Deprecated.
|
boolean |
isDetectCharset()
Deprecated.
|
boolean |
isDetectContentType()
Deprecated.
|
boolean |
isScreenshotEnabled()
Deprecated.
Gets whether to enable taking screenshot of crawled web pages.
|
boolean |
isScreenshotScaleStretch()
Deprecated.
Gets whether the screenshot should be stretch to to fill all
the scale dimensions.
|
protected void |
loadHttpFetcherFromXML(XML xml)
Deprecated.
|
protected void |
saveHttpFetcherToXML(XML xml)
Deprecated.
|
void |
setContentTypePattern(String contentTypePattern)
Deprecated.
|
void |
setDetectCharset(boolean detectCharset)
Deprecated.
|
void |
setDetectContentType(boolean detectContentType)
Deprecated.
|
void |
setExePath(String exePath)
Deprecated.
|
void |
setHeadersPrefix(String headersPrefix)
Deprecated.
|
void |
setNotFoundStatusCodes(int... notFoundStatusCodes)
Deprecated.
Sets HTTP status codes to be considered as "Not found" state.
|
void |
setNotFoundStatusCodes(List<Integer> notFoundStatusCodes)
Deprecated.
Sets HTTP status codes to be considered as "Not found" state.
|
void |
setOptions(List<String> options)
Deprecated.
Sets optional extra PhantomJS command-line options.
|
void |
setOptions(String... options)
Deprecated.
Sets optional extra PhantomJS command-line options.
|
void |
setReferencePattern(String referencePattern)
Deprecated.
|
void |
setRenderWaitTime(int renderWaitTime)
Deprecated.
|
void |
setResourceTimeout(int resourceTimeout)
Deprecated.
Sets the milliseconds timeout after which any resource requested will
stop trying and proceed with other parts of the page.
|
void |
setScreenshotDimensions(Dimension screenshotDimensions)
Deprecated.
|
void |
setScreenshotDimensions(int width,
int height)
Deprecated.
|
void |
setScreenshotEnabled(boolean screenshotEnabled)
Deprecated.
Sets whether to enable taking screenshot of crawled web pages.
|
void |
setScreenshotImageFormat(String screenshotImageFormat)
Deprecated.
Sets the screenshot image format (jpg, png, gif, bmp, etc.).
|
void |
setScreenshotScaleDimensions(Dimension screenshotScaleDimensions)
Deprecated.
Sets the pixel dimensions we want the stored screenshot to have.
|
void |
setScreenshotScaleDimensions(int width,
int height)
Deprecated.
Sets the pixel dimensions we want the stored screenshot to have.
|
void |
setScreenshotScaleQuality(PhantomJSDocumentFetcher.Quality screenshotScaleQuality)
Deprecated.
Sets the screenshot scaling quality to use when when storage
is "disk" or "inline".
|
void |
setScreenshotScaleStretch(boolean screenshotScaleStretch)
Deprecated.
Sets whether the screenshot should be stretch to to fill all
the scale dimensions.
|
void |
setScreenshotStorage(List<PhantomJSDocumentFetcher.Storage> screenshotStorage)
Deprecated.
Sets the screenshot storage mechanisms.
|
void |
setScreenshotStorage(PhantomJSDocumentFetcher.Storage... screenshotStorage)
Deprecated.
Sets the screenshot storage mechanisms.
|
void |
setScreenshotStorageDiskDir(String screenshotStorageDiskDir)
Deprecated.
Sets the directory where screenshots are saved when storage is "disk".
|
void |
setScreenshotStorageDiskField(String screenshotStorageDiskField)
Deprecated.
Sets the target document metadata field where to store the path
to thescreen shot image file when storage is "disk".
|
void |
setScreenshotStorageDiskStructure(PhantomJSDocumentFetcher.StorageDiskStructure screenshotStorageDiskStructure)
Deprecated.
Sets the screenshot directory structure to create when storage
is "disk".
|
void |
setScreenshotStorageInlineField(String screenshotStorageInlineField)
Deprecated.
Sets the target document metadata field where to store the inline
(Base64) screenshot image when storage is "inline".
|
void |
setScreenshotZoomFactor(float screenshotZoomFactor)
Deprecated.
|
void |
setScriptPath(String scriptPath)
Deprecated.
|
void |
setValidExitCodes(int... validExitCodes)
Deprecated.
Sets valid PhantomJS exit values (defaults to 0).
|
void |
setValidExitCodes(List<Integer> validExitCodes)
Deprecated.
Sets valid PhantomJS exit values (defaults to 0).
|
void |
setValidStatusCodes(int... validStatusCodes)
Deprecated.
Gets valid HTTP response status codes.
|
void |
setValidStatusCodes(List<Integer> validStatusCodes)
Deprecated.
Gets valid HTTP response status codes.
|
String |
toString()
Deprecated.
|
accept, fetcherShutdown, fetcherStartup, fetcherThreadBegin, fetcherThreadEnd, getReferenceFilters, loadFromXML, saveToXML, setReferenceFilters, setReferenceFilters
public static final String DEFAULT_SCRIPT_PATH
public static final int DEFAULT_RENDER_WAIT_TIME
public static final float DEFAULT_SCREENSHOT_ZOOM_FACTOR
public static final String DEFAULT_CONTENT_TYPE_PATTERN
public static final String DEFAULT_SCREENSHOT_STORAGE_DISK_DIR
public static final PhantomJSDocumentFetcher.Storage DEFAULT_SCREENSHOT_STORAGE
public static final String DEFAULT_SCREENSHOT_IMAGE_FORMAT
public static final Dimension DEFAULT_SCREENSHOT_SCALE_SIZE
public static final String COLLECTOR_PHANTOMJS_SCREENSHOT_PATH
public static final String COLLECTOR_PHANTOMJS_SCREENSHOT_INLINE
public PhantomJSDocumentFetcher()
public PhantomJSDocumentFetcher(int[] validStatusCodes)
public String getExePath()
public void setExePath(String exePath)
public String getScriptPath()
public void setScriptPath(String scriptPath)
public int getRenderWaitTime()
public void setRenderWaitTime(int renderWaitTime)
public void setOptions(List<String> options)
options
- extra command line argumentspublic void setOptions(String... options)
options
- extra command line argumentspublic String getScreenshotStorageDiskDir()
public void setScreenshotStorageDiskDir(String screenshotStorageDiskDir)
screenshotStorageDiskDir
- directorypublic String getScreenshotStorageDiskField()
public void setScreenshotStorageDiskField(String screenshotStorageDiskField)
screenshotStorageDiskField
- field namepublic String getScreenshotStorageInlineField()
public void setScreenshotStorageInlineField(String screenshotStorageInlineField)
screenshotStorageInlineField
- field namepublic boolean isScreenshotEnabled()
true
if enabledpublic void setScreenshotEnabled(boolean screenshotEnabled)
screenshotEnabled
- true
if enabledpublic Dimension getScreenshotDimensions()
public void setScreenshotDimensions(int width, int height)
public void setScreenshotDimensions(Dimension screenshotDimensions)
public float getScreenshotZoomFactor()
public void setScreenshotZoomFactor(float screenshotZoomFactor)
public List<Integer> getValidExitCodes()
public void setValidExitCodes(List<Integer> validExitCodes)
validExitCodes
- valid exit codespublic void setValidExitCodes(int... validExitCodes)
validExitCodes
- valid exit codespublic void setValidStatusCodes(List<Integer> validStatusCodes)
validStatusCodes
- valid status codespublic void setValidStatusCodes(int... validStatusCodes)
validStatusCodes
- valid status codespublic List<Integer> getNotFoundStatusCodes()
public final void setNotFoundStatusCodes(int... notFoundStatusCodes)
notFoundStatusCodes
- "Not found" codespublic final void setNotFoundStatusCodes(List<Integer> notFoundStatusCodes)
notFoundStatusCodes
- "Not found" codespublic String getHeadersPrefix()
public void setHeadersPrefix(String headersPrefix)
public boolean isDetectContentType()
public void setDetectContentType(boolean detectContentType)
public boolean isDetectCharset()
public void setDetectCharset(boolean detectCharset)
public String getContentTypePattern()
public void setContentTypePattern(String contentTypePattern)
public String getReferencePattern()
public void setReferencePattern(String referencePattern)
public int getResourceTimeout()
-1
if undefinedpublic void setResourceTimeout(int resourceTimeout)
resourceTimeout
- the timeout value, or -1
for undefinedpublic Dimension getScreenshotScaleDimensions()
public void setScreenshotScaleDimensions(Dimension screenshotScaleDimensions)
screenshotScaleDimensions
- dimensionpublic void setScreenshotScaleDimensions(int width, int height)
width
- image widthheight
- image heightpublic boolean isScreenshotScaleStretch()
true
to stretchpublic void setScreenshotScaleStretch(boolean screenshotScaleStretch)
screenshotScaleStretch
- true
to stretchpublic String getScreenshotImageFormat()
public void setScreenshotImageFormat(String screenshotImageFormat)
screenshotImageFormat
- image formatpublic List<PhantomJSDocumentFetcher.Storage> getScreenshotStorage()
null
)public void setScreenshotStorage(List<PhantomJSDocumentFetcher.Storage> screenshotStorage)
screenshotStorage
- storage mechanismspublic void setScreenshotStorage(PhantomJSDocumentFetcher.Storage... screenshotStorage)
screenshotStorage
- storage mechanismspublic PhantomJSDocumentFetcher.StorageDiskStructure getScreenshotStorageDiskStructure()
public void setScreenshotStorageDiskStructure(PhantomJSDocumentFetcher.StorageDiskStructure screenshotStorageDiskStructure)
screenshotStorageDiskStructure
- directory structurepublic PhantomJSDocumentFetcher.Quality getScreenshotScaleQuality()
PhantomJSDocumentFetcher.Quality.AUTO
public void setScreenshotScaleQuality(PhantomJSDocumentFetcher.Quality screenshotScaleQuality)
screenshotScaleQuality
- qualitypublic String getUserAgent()
public boolean accept(Doc doc, HttpMethod httpMethod)
accept
in interface IHttpFetcher
accept
in class AbstractHttpFetcher
protected boolean accept(HttpMethod httpMethod)
AbstractHttpFetcher
accept
in class AbstractHttpFetcher
httpMethod
- the HTTP methodtrue
if supportedpublic IHttpFetchResponse fetch(CrawlDoc doc, HttpMethod httpMethod) throws HttpFetchException
IHttpFetcher
Performs an HTTP request for the supplied document reference and HTTP method.
For each HTTP method supported, implementors should
do their best to populate the document and its CrawlDocInfo
with as much information they can.
Unsupported HTTP methods should return an HTTP response with the
CrawlState.UNSUPPORTED
state. To prevent userse having to
configure multiple HTTP clients, implementors should try to support
both the GET
and HEAD
methods.
POST is only used in special cases and is often not used during a
crawl session.
A null
method is treated as a GET
.
doc
- document to fetch or to use to make the request.httpMethod
- HTTP methodHttpFetchException
- problem when fetching the documentHttpFetchResponseBuilder.unsupported()
protected void loadHttpFetcherFromXML(XML xml)
loadHttpFetcherFromXML
in class AbstractHttpFetcher
protected void saveHttpFetcherToXML(XML xml)
saveHttpFetcherToXML
in class AbstractHttpFetcher
public boolean equals(Object other)
equals
in class AbstractHttpFetcher
public int hashCode()
hashCode
in class AbstractHttpFetcher
public String toString()
toString
in class AbstractHttpFetcher
Copyright © 2009–2023 Norconex Inc.. All rights reserved.