Class AbstractHttpFetcher

  • All Implemented Interfaces:
    IHttpFetcher, IEventListener<Event>, IXMLConfigurable, EventListener, Consumer<Event>
    Direct Known Subclasses:
    GenericHttpFetcher, PhantomJSDocumentFetcher, WebDriverHttpFetcher

    public abstract class AbstractHttpFetcher
    extends Object
    implements IHttpFetcher, IXMLConfigurable, IEventListener<Event>

    Base class implementing the accept(Doc, HttpMethod) method using reference filters to determine if this fetcher will accept to fetch a URL and delegating the HTTP method check to its own accept(HttpMethod) abstract method. It also offers methods to overwrite in order to react to crawler startup and shutdown events.

    XML configuration usage:

    Subclasses inherit this IXMLConfigurable configuration:

    XML configuration usage:

    
    <referenceFilters>
      <!-- multiple "filter" tags allowed -->
      <filter
          class="(any reference filter class)">
        (Restrict usage of this fetcher to matching reference filters.
         Refer to the documentation for the IReferenceFilter implementation
         you are using here for usage details.)
      </filter>
    </referenceFilters>

    Usage example:

    This filter example will restrict applying an HTTP Fetcher to URLs ending with ".pdf".

    XML usage example:

    
    <referenceFilters>
      <filter
          class="ReferenceFilter"
          onMatch="exclude">
        <valueMatcher
            method="regex">
          https://example\.com/pdfs/.*
        </valueMatcher>
      </filter>
    </referenceFilters>
    Since:
    3.0.0
    Author:
    Pascal Essiembre
    • Constructor Detail

      • AbstractHttpFetcher

        public AbstractHttpFetcher()
    • Method Detail

      • getReferenceFilters

        public List<IReferenceFilter> getReferenceFilters()
        Gets reference filters
        Returns:
        reference filters
      • setReferenceFilters

        public void setReferenceFilters​(IReferenceFilter... referenceFilters)
        Sets reference filters.
        Parameters:
        referenceFilters - reference filters to set
      • setReferenceFilters

        public void setReferenceFilters​(List<IReferenceFilter> referenceFilters)
        Sets reference filters.
        Parameters:
        referenceFilters - the referenceFilters to set
      • accept

        protected abstract boolean accept​(HttpMethod httpMethod)
        Whether the supplied HttpMethod is supported by this fetcher.
        Parameters:
        httpMethod - the HTTP method
        Returns:
        true if supported
      • fetcherStartup

        protected void fetcherStartup​(HttpCollector collector)
        Invoked once per fetcher instance, when the collector starts. Default implementation does nothing.
        Parameters:
        collector - collector
      • fetcherShutdown

        protected void fetcherShutdown​(HttpCollector collector)
        Invoked once per fetcher when the collector ends. Default implementation does nothing.
        Parameters:
        collector - collector
      • fetcherThreadBegin

        protected void fetcherThreadBegin​(HttpCrawler crawler)
        Invoked each time a crawler begins a new crawler thread if that thread is the current thread. Default implementation does nothing.
        Parameters:
        crawler - crawler
      • fetcherThreadEnd

        protected void fetcherThreadEnd​(HttpCrawler crawler)
        Invoked each time a crawler ends an existing crawler thread if that thread is the current thread. Default implementation does nothing.
        Parameters:
        crawler - crawler
      • loadHttpFetcherFromXML

        protected abstract void loadHttpFetcherFromXML​(XML xml)
      • saveHttpFetcherToXML

        protected abstract void saveHttpFetcherToXML​(XML xml)
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class Object