public class GenericDocumentFetcher extends Object implements IHttpDocumentFetcher, IXMLConfigurable
Default implementation of IHttpDocumentFetcher
.
The default behavior of the HTTP Collector to identify the content type
and character encoding of a document is to rely on the
"Content-Type"
HTTP response header. Web servers can sometimes return invalid
or missing content type and character encoding information. Since 2.7.0,
you can optionally decide not to trust web servers HTTP responses and have
the collector perform its own content type and encoding detection.
Such detection can be enabled with setDetectContentType(boolean)
and setDetectCharset(boolean)
.
<documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher" detectContentType="[false|true]" detectCharset="[false|true]"> <validStatusCodes>(defaults to 200)</validStatusCodes> <notFoundStatusCodes>(defaults to 404)</notFoundStatusCodes> <headersPrefix>(string to prefix headers)</headersPrefix> </documentFetcher>
The "validStatusCodes" and "notFoundStatusCodes" elements expect a coma-separated list of HTTP response code. If a code is added in both elements, the valid list takes precedence.
The "notFoundStatusCodes" element was added in 2.2.0.
The following configures the document fetcher to not trust HTTP response headers to identify the content type and encoding, but try to detect them instead.
<documentFetcher detectContentType="true" detectCharset="true"/>
Constructor and Description |
---|
GenericDocumentFetcher() |
GenericDocumentFetcher(int[] validStatusCodes) |
Modifier and Type | Method and Description |
---|---|
protected org.apache.http.client.methods.HttpRequestBase |
createUriRequest(HttpDocument doc)
Creates the HTTP request to be executed.
|
boolean |
equals(Object other) |
HttpFetchResponse |
fetchDocument(org.apache.http.client.HttpClient httpClient,
HttpDocument doc)
Fetches HTTP document and saves it to a local file
|
String |
getHeadersPrefix() |
int[] |
getNotFoundStatusCodes()
Gets HTTP status codes to be considered as "Not found" state.
|
int[] |
getValidStatusCodes() |
int |
hashCode() |
boolean |
isDetectCharset()
Gets whether character encoding is detected instead of relying on
HTTP response header.
|
boolean |
isDetectContentType()
Gets whether content type is detected instead of relying on
HTTP response header.
|
void |
loadFromXML(Reader in) |
void |
saveToXML(Writer out) |
void |
setDetectCharset(boolean detectCharset)
Sets whether character encoding is detected instead of relying on
HTTP response header.
|
void |
setDetectContentType(boolean detectContentType)
Sets whether content type is detected instead of relying on
HTTP response header.
|
void |
setHeadersPrefix(String headersPrefix) |
void |
setNotFoundStatusCodes(int... notFoundStatusCodes)
Sets HTTP status codes to be considered as "Not found" state.
|
void |
setValidStatusCodes(int... validStatusCodes) |
String |
toString() |
public GenericDocumentFetcher()
public GenericDocumentFetcher(int[] validStatusCodes)
public int[] getValidStatusCodes()
public final void setValidStatusCodes(int... validStatusCodes)
public int[] getNotFoundStatusCodes()
public final void setNotFoundStatusCodes(int... notFoundStatusCodes)
notFoundStatusCodes
- "Not found" codespublic String getHeadersPrefix()
public void setHeadersPrefix(String headersPrefix)
public boolean isDetectContentType()
true
to enable detectionpublic void setDetectContentType(boolean detectContentType)
detectContentType
- true
to enable detectionpublic boolean isDetectCharset()
true
to enable detectionpublic void setDetectCharset(boolean detectCharset)
detectCharset
- true
to enable detectionpublic HttpFetchResponse fetchDocument(org.apache.http.client.HttpClient httpClient, HttpDocument doc)
IHttpDocumentFetcher
fetchDocument
in interface IHttpDocumentFetcher
httpClient
- the HTTP clientdoc
- the document to fetch and saveprotected org.apache.http.client.methods.HttpRequestBase createUriRequest(HttpDocument doc)
HttpGet
request around the document reference.
This method can be overwritten to return another type of request,
add HTTP headers, etc.doc
- document to fetchpublic void loadFromXML(Reader in)
loadFromXML
in interface IXMLConfigurable
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
Copyright © 2009–2021 Norconex Inc.. All rights reserved.