When used with a Norconex Crawler,
you can use the following XML to configure
Elasticsearch as the <committer>
section of your
Norconex Crawler configuration:
<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter"> <nodes>...</nodes> <indexName>...</indexName> <typeName>...</typeName> <ignoreResponseErrors>[false|true]</ignoreResponseErrors> <discoverNodes>[false|true]</discoverNodes> <dotReplacement>...</dotReplacement> <jsonFieldsPattern>...</jsonFieldsPattern> <connectionTimeout>(milliseconds)</connectionTimeout> <socketTimeout>(milliseconds)</socketTimeout> <maxRetryTimeout>(milliseconds)</maxRetryTimeout> <fixBadIds>[false|true]</fixBadIds> <username>...</username> <password>...</password> <passwordKey>...</passwordKey> <passwordKeySource>[key|file|environment|property]</passwordKeySource> <sourceReferenceField keep="[false|true]">...</sourceReferenceField> <sourceContentField keep="[false|true]">...</sourceContentField> <targetContentField>...</targetContentField> <queueDir>...</queueDir> <queueSize>...</queueSize> <commitBatchSize>...</commitBatchSize> <maxRetries>...</maxRetries> <maxRetryWait>...</maxRetryWait> </committer>
Tag descriptions:
Tag | Description |
---|---|
nodes |
Comma delimited list of host URLs to connect to join the cluster.
Default is http://localhost:9200 .
|
indexName | Index name to use when committing documents to Elasticsearch. |
typeName | Type name to use when committing documents to Elasticsearch. |
ignoreResponseErrors |
Optionally ignore errors in Elasticsearch response. When ignored,
errors are logged instead of throwning an exception.
Default is false .
|
discoverNodes |
Optionally enable automatic discovery of cluster nodes beyond
the configured ones. Default is false .
|
dotReplacement |
Optionally replace dots in field names with any value. Default
is null (does not replace dots).
|
jsonFieldsPattern | Optional regular expression to identify fields containing JSON objects instead of regular strings. |
connectionTimeout | Elasticsearch connection timeout (default 1 second). |
socketTimeout | Elasticsearch socket timeout (default 30 seconds). |
maxRetryTimeout | Maximum amount of time to wait before retrying a failing Elasticsearch host (default 30 seconds). |
fixBadIds | Flag to fix ids not matching Elasticsearch ID limitations. |
username | Basic authentication user name. |
password | Basic authentication password. |
passwordKey | Reference to password key (or actual key) for encrypted passwords. See the API Documentation for encryption instructions. |
passwordKeySource | Source of password key for encrypted passwords. See the API Documentation for encryption instructions. |
sourceReferenceField |
Name of source field that will be mapped to the Elasticsearch id field.
Default is the document reference the Committer stores as
document.reference . The metadata source field is deleted,
unless keep is set to true .
|
sourceContentField |
Source field name for a document content/body. Default is not a field,
but rather the document body content. Once re-mapped, the metadata
source field is deleted, unless keep is set to
true .
|
targetContentField |
Target field name for a document content/body. Default is:
content .
|
queueDir |
Optional path where to queue files before sending them to Elasticsearch.
Default is: ./committer-queue .
|
queueSize |
Optional maximum queue size before sending document to Elasticsearch.
Default is: 1000 .
|
commitBatchSize |
Optional maximum of documents to send to Elasticsearch at once.
Default is: 100 .
|
maxRetries | Maximum retries upon commit failures. Default is 0 (no retry). |
maxRetryWait | Maximum delay (millisecond) between retries. Default is 0 (no delay). |