Class PhantomJSDocumentFetcher

java.lang.Object
com.norconex.collector.http.fetch.AbstractHttpFetcher
com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher
All Implemented Interfaces:
IHttpFetcher, IEventListener<Event>, IXMLConfigurable, EventListener, Consumer<Event>

@Deprecated public class PhantomJSDocumentFetcher extends AbstractHttpFetcher
Deprecated.
Since 3.0.0 use WebDriverHttpFetcher

Deprecation notice

PhantomJS headless browser is no longer maintained by its owner. As such, starting with version 3.0.0, use of PhantomJSDocumentFetcher is strongly discouraged and HttpClientProxy support for it has been dropped. With more popular browsers (e.g. Chrome) now supporting operating in headless mode, we now have more stable options. Please consider using WebDriverHttpFetcher instead when attempting to crawl a JavaScript-driven website.


An alternative to the GenericHttpFetcher which relies on an external PhantomJS installation to fetch web pages. While less efficient, this implementation is meant to provide some way to crawl sites making heavy use of JavaScript to render their pages. This class tells the PhantomJS headless browser to wait a certain amount of time for the page to load extra content via Ajax requests before grabbing all loaded HTML.

Considerations

Relying on an external software to fetch pages is slower and not as scalable and may be less stable. The use of GenericHttpFetcher should be preferred whenever possible. Use at your own risk. Use PhantomJS 2.1 (or possibly higher).

Handling of non-HTML Pages

It is usually only useful to use PhantomJS for HTML pages with JavaScript. Other types of documents are fetched using an instance of GenericHttpFetcher To find out if we are dealing with an HTML documents, this fetcher needs to know the content type first. By default, the content type of a document is not known before a physical copy is obtained. This means PhantomJS has to first download the document and if it is not an HTML document at that point, it will be re-downloaded again with the generic document fetcher. By default, these content-types are considered HTML:

 text/html, application/xhtml+xml, application/vnd.wap.xhtml+xml, application/x-asp
 

Those can be overwritten with setContentTypePattern(String).

Avoid double-downloads

To avoid downloading the document twice as described above, you can configure a metadata fetcher (such as GenericHttpFetcher). This will attempt get the content type by first making an HTTP HEAD request.

Alternatively, if you have a URL pattern that identifies your HTML pages (and only HTML pages), you can specify it using setReferencePattern(String). Only URLs matching the provided regular expression will be fetched by PhantomJS. By default there is no pattern for discriminating on URL references.

Taking screenshots of pages

Thanks to PhantomJS, one can save images of pages being crawled, including those rendered with JavaScript!

Since 2.8.0, you have to explicitely enabled screenshots with setScreenshotEnabled(boolean). Also screenshots now share the same size by default. In addition, you can now control how screenshots are resized and how they are stored stored. Storage options:

  • inline: Stores a Base64 string of the scaled image, in the format specified, in a collector.featured-image-inline field. The string is ready to be used inline, in a <img src="..."> tag.
  • disk: Stores the scaled image on the file system, in the format and directory specified. A reference to the file on disk is stored in a collector.featured-image-path field.

Since 2.8.0, it is possible to specify a resource timeout so that slow individual page resources do not cause PhantomJS to hang for a long time.

PhantomJS exit values

Since 2.9.1, it is possible to specify which PhantomJS exit values are to be considered "valid". Use a comma-separated-list of integers using the setValidExitCodes(int...) method. By default, only zero is considered valid.

XML configuration entries expecting millisecond durations can be provided in human-readable format (English only), as per DurationParser (e.g., "5 minutes and 30 seconds" or "5m30s").

XML configuration usage:

  <documentFetcher
      class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher"
      detectContentType="[false|true]" detectCharset="[false|true]"
      screenshotEnabled="[false|true]">
      <exePath>(path to PhantomJS executable)</exePath>
      <scriptPath>
          (Optional path to a PhantomJS script. Defaults to scripts/phantom.js)
      </scriptPath>
      <renderWaitTime>
          (Milliseconds to wait for the entire page to load.
           Defaults to 3000, i.e., 3 seconds.)
      </renderWaitTime>
      <resourceTimeout>
          (Optional Milliseconds to wait for a page resource to load.
           Defaults is unspecified.)
      </resourceTimeout>
      <options>
        <opt>(optional extra PhantomJS command-line option)</opt>
        <!-- You can have multiple opt tags -->
      </options>
      <referencePattern>
          (Regular expression matching URLs for which to use the
           PhantomJS browser. Non-matching URLs will fallback
           to using GenericDocumentFetcher.)
      </referencePattern>
      <contentTypePattern>
          (Regular expression matching content types for which to use
           the PhantomJS browser. Non-matching content types will use
           the GenericDocumentFetcher.)
      </contentTypePattern>
      <validExitCodes>(defaults to 0)</validExitCodes>
      <validStatusCodes>(defaults to 200)</validStatusCodes>
      <notFoundStatusCodes>(defaults to 404)</notFoundStatusCodes>
      <headersPrefix>(string to prefix headers)</headersPrefix>

      <!-- Only applicable when screenshotEnabled is true: -->
      <screenshotDimensions>
          (Pixel size of the browser page area to capture: [width]x[height].
           E.g., 800x600.  Only used when a screenshot path is specified.
           Default is undefined. It will try to load all it can and may
           produce vertically long images.)
      </screenshotDimensions>
      <screenshotZoomFactor>
          (A decimal value to scale the screenshot image.
           E.g., 0.25  will make the image 25% its regular size,
           which is 25% of the above dimension if specified.
           Default is 1, i.e., 100%)
      </screenshotZoomFactor>
      <screenshotScaleDimensions>
         (Target pixel size the main image should be scaled to.
          Default is 300.)
      </screenshotScaleDimensions>
      <screenshotScaleStretch>
         [false|true]
         (Whether to stretch to match scale size. Default keeps aspect ratio.)
      </screenshotScaleStretch>
      <screenshotScaleQuality>
          [auto|low|medium|high|max]
          (Default is "auto", which tries the best balance between quality
           and speed based on image size. The lower the quality the faster
           it is to scale images.)
      </screenshotScaleQuality>
      <screenshotImageFormat>
         (Target format of stored image. E.g., "jpg", "png", "gif", "bmp", ...
          Default is "png")
      </screenshotImageFormat>
      <screenshotStorage>
         [disk|inline]
         (One or both, comma-separated. Default is "disk".)
      </screenshotStorage>

      <!-- Only applicable for "disk" storage: -->
      <screenshotStorageDiskDir structure="[url2path|date|datetime]">
          (Path where to save screenshots.)
      </screenshotStorageDiskDir>
      <screenshotStorageDiskField>
          (Overwrite default field where to store the screenshot path.)
      </screenshotStorageDiskField>

      <!-- Only applicable for "inline" storage: -->
      <screenshotStorageInlineField>
          (Overwrite default field where to store the inline screenshot.)
      </screenshotStorageInlineField>

  </documentFetcher>
 

When specifying an image size, the format is [width]x[height] or a single value. When a single value is used, that value represents both the width and height (i.e., a square).

The "validStatusCodes" and "notFoundStatusCodes" elements expect a coma-separated list of HTTP response code. If a code is added in both elements, the valid list takes precedence.

Usage example:

The following configures HTTP Collector to use PhantomJS with a proxy to use HttpClient, only for URLs ending with ".html".

  <httpcollector id="MyHttpCollector">
    ...
    <crawlers>
      <crawler id="MyCrawler">
        ...
        <documentFetcher class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher">
          <exePath>/path/to/phantomjs.exe</exePath>
          <renderWaitTime>5000</renderWaitTime>
          <referencePattern>^.*\.html$</referencePattern>
        </documentFetcher>
        ...
      </crawler>
    </crawlers>
    ...
    <!-- Only if you need to use the HttpClient proxy (see documentation): -->
    <collectorListeners>
      <listener class="com.norconex.collector.http.fetch.impl.HttpClientProxyCollectorListener" />
    </collectorListeners>
  </httpcollector>
 
Since:
2.7.0
Author:
Pascal Essiembre
  • Field Details

    • DEFAULT_SCRIPT_PATH

      public static final String DEFAULT_SCRIPT_PATH
      Deprecated.
      See Also:
    • DEFAULT_RENDER_WAIT_TIME

      public static final int DEFAULT_RENDER_WAIT_TIME
      Deprecated.
      See Also:
    • DEFAULT_SCREENSHOT_ZOOM_FACTOR

      public static final float DEFAULT_SCREENSHOT_ZOOM_FACTOR
      Deprecated.
      See Also:
    • DEFAULT_CONTENT_TYPE_PATTERN

      public static final String DEFAULT_CONTENT_TYPE_PATTERN
      Deprecated.
      See Also:
    • DEFAULT_SCREENSHOT_STORAGE_DISK_DIR

      public static final String DEFAULT_SCREENSHOT_STORAGE_DISK_DIR
      Deprecated.
      See Also:
    • DEFAULT_SCREENSHOT_STORAGE

      public static final PhantomJSDocumentFetcher.Storage DEFAULT_SCREENSHOT_STORAGE
      Deprecated.
    • DEFAULT_SCREENSHOT_IMAGE_FORMAT

      public static final String DEFAULT_SCREENSHOT_IMAGE_FORMAT
      Deprecated.
      See Also:
    • DEFAULT_SCREENSHOT_SCALE_SIZE

      public static final Dimension DEFAULT_SCREENSHOT_SCALE_SIZE
      Deprecated.
    • COLLECTOR_PHANTOMJS_SCREENSHOT_PATH

      public static final String COLLECTOR_PHANTOMJS_SCREENSHOT_PATH
      Deprecated.
      See Also:
    • COLLECTOR_PHANTOMJS_SCREENSHOT_INLINE

      public static final String COLLECTOR_PHANTOMJS_SCREENSHOT_INLINE
      Deprecated.
      See Also:
  • Constructor Details

    • PhantomJSDocumentFetcher

      public PhantomJSDocumentFetcher()
      Deprecated.
    • PhantomJSDocumentFetcher

      public PhantomJSDocumentFetcher(int[] validStatusCodes)
      Deprecated.
  • Method Details

    • getExePath

      public String getExePath()
      Deprecated.
    • setExePath

      public void setExePath(String exePath)
      Deprecated.
    • getScriptPath

      public String getScriptPath()
      Deprecated.
    • setScriptPath

      public void setScriptPath(String scriptPath)
      Deprecated.
    • getRenderWaitTime

      public int getRenderWaitTime()
      Deprecated.
    • setRenderWaitTime

      public void setRenderWaitTime(int renderWaitTime)
      Deprecated.
    • getOptions

      public List<String> getOptions()
      Deprecated.
    • setOptions

      public void setOptions(List<String> options)
      Deprecated.
      Sets optional extra PhantomJS command-line options.
      Parameters:
      options - extra command line arguments
      Since:
      3.0.0
    • setOptions

      public void setOptions(String... options)
      Deprecated.
      Sets optional extra PhantomJS command-line options.
      Parameters:
      options - extra command line arguments
    • getScreenshotStorageDiskDir

      public String getScreenshotStorageDiskDir()
      Deprecated.
      Gets the directory where screenshots are saved when storage is "disk". Default is "./screenshots".
      Returns:
      directory
      Since:
      2.8.0
    • setScreenshotStorageDiskDir

      public void setScreenshotStorageDiskDir(String screenshotStorageDiskDir)
      Deprecated.
      Sets the directory where screenshots are saved when storage is "disk". Use this method to overwrite the default ("./screenshots").
      Parameters:
      screenshotStorageDiskDir - directory
      Since:
      2.8.0
    • getScreenshotStorageDiskField

      public String getScreenshotStorageDiskField()
      Deprecated.
      Gets the target document metadata field where to store the path to thescreen shot image file when storage is "disk". Default is "collector.phantomjs-screenshot-path".
      Returns:
      field name
      Since:
      2.8.0
    • setScreenshotStorageDiskField

      public void setScreenshotStorageDiskField(String screenshotStorageDiskField)
      Deprecated.
      Sets the target document metadata field where to store the path to thescreen shot image file when storage is "disk". Use this method to overwrite the default ("collector.phantomjs-screenshot-path").
      Parameters:
      screenshotStorageDiskField - field name
      Since:
      2.8.0
    • getScreenshotStorageInlineField

      public String getScreenshotStorageInlineField()
      Deprecated.
      Gets the target document metadata field where to store the inline (Base64) screenshot image when storage is "inline". Default is "collector.phantomjs-screenshot-inline".
      Returns:
      field name
      Since:
      2.8.0
    • setScreenshotStorageInlineField

      public void setScreenshotStorageInlineField(String screenshotStorageInlineField)
      Deprecated.
      Sets the target document metadata field where to store the inline (Base64) screenshot image when storage is "inline". Use this method to overwrite the default ("collector.phantomjs-screenshot-inline").
      Parameters:
      screenshotStorageInlineField - field name
      Since:
      2.8.0
    • isScreenshotEnabled

      public boolean isScreenshotEnabled()
      Deprecated.
      Gets whether to enable taking screenshot of crawled web pages.
      Returns:
      true if enabled
      Since:
      2.8.0
    • setScreenshotEnabled

      public void setScreenshotEnabled(boolean screenshotEnabled)
      Deprecated.
      Sets whether to enable taking screenshot of crawled web pages.
      Parameters:
      screenshotEnabled - true if enabled
      Since:
      2.8.0
    • getScreenshotDimensions

      public Dimension getScreenshotDimensions()
      Deprecated.
    • setScreenshotDimensions

      public void setScreenshotDimensions(int width, int height)
      Deprecated.
    • setScreenshotDimensions

      public void setScreenshotDimensions(Dimension screenshotDimensions)
      Deprecated.
    • getScreenshotZoomFactor

      public float getScreenshotZoomFactor()
      Deprecated.
    • setScreenshotZoomFactor

      public void setScreenshotZoomFactor(float screenshotZoomFactor)
      Deprecated.
    • getValidExitCodes

      public List<Integer> getValidExitCodes()
      Deprecated.
      Sets valid PhantomJS exit values (defaults to 0).
      Returns:
      valid exit codes
      Since:
      2.9.1
    • setValidExitCodes

      public void setValidExitCodes(List<Integer> validExitCodes)
      Deprecated.
      Sets valid PhantomJS exit values (defaults to 0).
      Parameters:
      validExitCodes - valid exit codes
      Since:
      2.9.1
    • setValidExitCodes

      public void setValidExitCodes(int... validExitCodes)
      Deprecated.
      Sets valid PhantomJS exit values (defaults to 0).
      Parameters:
      validExitCodes - valid exit codes
      Since:
      2.9.1
    • getValidStatusCodes

      public List<Integer> getValidStatusCodes()
      Deprecated.
    • setValidStatusCodes

      public void setValidStatusCodes(List<Integer> validStatusCodes)
      Deprecated.
      Gets valid HTTP response status codes.
      Parameters:
      validStatusCodes - valid status codes
      Since:
      3.0.0
    • setValidStatusCodes

      public void setValidStatusCodes(int... validStatusCodes)
      Deprecated.
      Gets valid HTTP response status codes.
      Parameters:
      validStatusCodes - valid status codes
    • getNotFoundStatusCodes

      public List<Integer> getNotFoundStatusCodes()
      Deprecated.
      Gets HTTP status codes to be considered as "Not found" state. Default is 404.
      Returns:
      "Not found" codes
    • setNotFoundStatusCodes

      public final void setNotFoundStatusCodes(int... notFoundStatusCodes)
      Deprecated.
      Sets HTTP status codes to be considered as "Not found" state.
      Parameters:
      notFoundStatusCodes - "Not found" codes
    • setNotFoundStatusCodes

      public final void setNotFoundStatusCodes(List<Integer> notFoundStatusCodes)
      Deprecated.
      Sets HTTP status codes to be considered as "Not found" state.
      Parameters:
      notFoundStatusCodes - "Not found" codes
      Since:
      3.0.0
    • getHeadersPrefix

      public String getHeadersPrefix()
      Deprecated.
    • setHeadersPrefix

      public void setHeadersPrefix(String headersPrefix)
      Deprecated.
    • isDetectContentType

      public boolean isDetectContentType()
      Deprecated.
    • setDetectContentType

      public void setDetectContentType(boolean detectContentType)
      Deprecated.
    • isDetectCharset

      public boolean isDetectCharset()
      Deprecated.
    • setDetectCharset

      public void setDetectCharset(boolean detectCharset)
      Deprecated.
    • getContentTypePattern

      public String getContentTypePattern()
      Deprecated.
    • setContentTypePattern

      public void setContentTypePattern(String contentTypePattern)
      Deprecated.
    • getReferencePattern

      public String getReferencePattern()
      Deprecated.
    • setReferencePattern

      public void setReferencePattern(String referencePattern)
      Deprecated.
    • getResourceTimeout

      public int getResourceTimeout()
      Deprecated.
      Gets the milliseconds timeout after which any resource requested will stop trying and proceed with other parts of the page.
      Returns:
      the timeout value, or -1 if undefined
      Since:
      2.8.0
    • setResourceTimeout

      public void setResourceTimeout(int resourceTimeout)
      Deprecated.
      Sets the milliseconds timeout after which any resource requested will stop trying and proceed with other parts of the page.
      Parameters:
      resourceTimeout - the timeout value, or -1 for undefined
      Since:
      2.8.0
    • getScreenshotScaleDimensions

      public Dimension getScreenshotScaleDimensions()
      Deprecated.
      Gets the pixel dimensions we want the stored screenshot to have.
      Returns:
      dimension
      Since:
      2.8.0
    • setScreenshotScaleDimensions

      public void setScreenshotScaleDimensions(Dimension screenshotScaleDimensions)
      Deprecated.
      Sets the pixel dimensions we want the stored screenshot to have.
      Parameters:
      screenshotScaleDimensions - dimension
      Since:
      2.8.0
    • setScreenshotScaleDimensions

      public void setScreenshotScaleDimensions(int width, int height)
      Deprecated.
      Sets the pixel dimensions we want the stored screenshot to have.
      Parameters:
      width - image width
      height - image height
      Since:
      2.8.0
    • isScreenshotScaleStretch

      public boolean isScreenshotScaleStretch()
      Deprecated.
      Gets whether the screenshot should be stretch to to fill all the scale dimensions. Default keeps aspect ratio.
      Returns:
      true to stretch
      Since:
      2.8.0
    • setScreenshotScaleStretch

      public void setScreenshotScaleStretch(boolean screenshotScaleStretch)
      Deprecated.
      Sets whether the screenshot should be stretch to to fill all the scale dimensions. Default keeps aspect ratio.
      Parameters:
      screenshotScaleStretch - true to stretch
      Since:
      2.8.0
    • getScreenshotImageFormat

      public String getScreenshotImageFormat()
      Deprecated.
      Gets the screenshot image format (jpg, png, gif, bmp, etc.).
      Returns:
      image format
      Since:
      2.8.0
    • setScreenshotImageFormat

      public void setScreenshotImageFormat(String screenshotImageFormat)
      Deprecated.
      Sets the screenshot image format (jpg, png, gif, bmp, etc.).
      Parameters:
      screenshotImageFormat - image format
      Since:
      2.8.0
    • getScreenshotStorage

      public List<PhantomJSDocumentFetcher.Storage> getScreenshotStorage()
      Deprecated.
      Gets the screenshot storage mechanisms.
      Returns:
      storage mechanisms (never null)
      Since:
      2.8.0
    • setScreenshotStorage

      public void setScreenshotStorage(List<PhantomJSDocumentFetcher.Storage> screenshotStorage)
      Deprecated.
      Sets the screenshot storage mechanisms.
      Parameters:
      screenshotStorage - storage mechanisms
      Since:
      3.0.0
    • setScreenshotStorage

      public void setScreenshotStorage(PhantomJSDocumentFetcher.Storage... screenshotStorage)
      Deprecated.
      Sets the screenshot storage mechanisms.
      Parameters:
      screenshotStorage - storage mechanisms
      Since:
      2.8.0
    • getScreenshotStorageDiskStructure

      public PhantomJSDocumentFetcher.StorageDiskStructure getScreenshotStorageDiskStructure()
      Deprecated.
      Gets the screenshot directory structure to create when storage is "disk".
      Returns:
      directory structure
      Since:
      2.8.0
    • setScreenshotStorageDiskStructure

      public void setScreenshotStorageDiskStructure(PhantomJSDocumentFetcher.StorageDiskStructure screenshotStorageDiskStructure)
      Deprecated.
      Sets the screenshot directory structure to create when storage is "disk".
      Parameters:
      screenshotStorageDiskStructure - directory structure
      Since:
      2.8.0
    • getScreenshotScaleQuality

      public PhantomJSDocumentFetcher.Quality getScreenshotScaleQuality()
      Deprecated.
      Gets the screenshot scaling quality to use when when storage is "disk" or "inline". Default is PhantomJSDocumentFetcher.Quality.AUTO
      Returns:
      quality
      Since:
      2.8.0
    • setScreenshotScaleQuality

      public void setScreenshotScaleQuality(PhantomJSDocumentFetcher.Quality screenshotScaleQuality)
      Deprecated.
      Sets the screenshot scaling quality to use when when storage is "disk" or "inline".
      Parameters:
      screenshotScaleQuality - quality
      Since:
      2.8.0
    • getUserAgent

      public String getUserAgent()
      Deprecated.
    • accept

      public boolean accept(Doc doc, HttpMethod httpMethod)
      Deprecated.
      Specified by:
      accept in interface IHttpFetcher
      Overrides:
      accept in class AbstractHttpFetcher
    • accept

      protected boolean accept(HttpMethod httpMethod)
      Deprecated.
      Description copied from class: AbstractHttpFetcher
      Whether the supplied HttpMethod is supported by this fetcher.
      Specified by:
      accept in class AbstractHttpFetcher
      Parameters:
      httpMethod - the HTTP method
      Returns:
      true if supported
    • fetch

      public IHttpFetchResponse fetch(CrawlDoc doc, HttpMethod httpMethod) throws HttpFetchException
      Deprecated.
      Description copied from interface: IHttpFetcher

      Performs an HTTP request for the supplied document reference and HTTP method.

      For each HTTP method supported, implementors should do their best to populate the document and its CrawlDocInfo with as much information they can.

      Unsupported HTTP methods should return an HTTP response with the CrawlState.UNSUPPORTED state. To prevent userse having to configure multiple HTTP clients, implementors should try to support both the GET and HEAD methods. POST is only used in special cases and is often not used during a crawl session.

      A null method is treated as a GET.

      Parameters:
      doc - document to fetch or to use to make the request.
      httpMethod - HTTP method
      Returns:
      an HTTP response
      Throws:
      HttpFetchException - problem when fetching the document
      See Also:
    • loadHttpFetcherFromXML

      protected void loadHttpFetcherFromXML(XML xml)
      Deprecated.
      Specified by:
      loadHttpFetcherFromXML in class AbstractHttpFetcher
    • saveHttpFetcherToXML

      protected void saveHttpFetcherToXML(XML xml)
      Deprecated.
      Specified by:
      saveHttpFetcherToXML in class AbstractHttpFetcher
    • equals

      public boolean equals(Object other)
      Deprecated.
      Overrides:
      equals in class AbstractHttpFetcher
    • hashCode

      public int hashCode()
      Deprecated.
      Overrides:
      hashCode in class AbstractHttpFetcher
    • toString

      public String toString()
      Deprecated.
      Overrides:
      toString in class AbstractHttpFetcher