When used with a Norconex Crawler,
you can use the following XML to configure
Elasticsearch as the <committer> section of your
Norconex Crawler configuration:
<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter"> <nodes>...</nodes> <indexName>...</indexName> <typeName>...</typeName> <ignoreResponseErrors>[false|true]</ignoreResponseErrors> <discoverNodes>[false|true]</discoverNodes> <dotReplacement>...</dotReplacement> <jsonFieldsPattern>...</jsonFieldsPattern> <connectionTimeout>(milliseconds)</connectionTimeout> <socketTimeout>(milliseconds)</socketTimeout> <maxRetryTimeout>(milliseconds)</maxRetryTimeout> <fixBadIds>[false|true]</fixBadIds> <username>...</username> <password>...</password> <passwordKey>...</passwordKey> <passwordKeySource>[key|file|environment|property]</passwordKeySource> <sourceReferenceField keep="[false|true]">...</sourceReferenceField> <sourceContentField keep="[false|true]">...</sourceContentField> <targetContentField>...</targetContentField> <queueDir>...</queueDir> <queueSize>...</queueSize> <commitBatchSize>...</commitBatchSize> <maxRetries>...</maxRetries> <maxRetryWait>...</maxRetryWait> </committer>
Tag descriptions:
| Tag | Description |
|---|---|
| nodes |
Comma delimited list of host URLs to connect to join the cluster.
Default is http://localhost:9200.
|
| indexName | Index name to use when committing documents to Elasticsearch. |
| typeName | Type name to use when committing documents to Elasticsearch. |
| ignoreResponseErrors |
Optionally ignore errors in Elasticsearch response. When ignored,
errors are logged instead of throwning an exception.
Default is false.
|
| discoverNodes |
Optionally enable automatic discovery of cluster nodes beyond
the configured ones. Default is false.
|
| dotReplacement |
Optionally replace dots in field names with any value. Default
is null (does not replace dots).
|
| jsonFieldsPattern | Optional regular expression to identify fields containing JSON objects instead of regular strings. |
| connectionTimeout | Elasticsearch connection timeout (default 1 second). |
| socketTimeout | Elasticsearch socket timeout (default 30 seconds). |
| maxRetryTimeout | Maximum amount of time to wait before retrying a failing Elasticsearch host (default 30 seconds). |
| fixBadIds | Flag to fix ids not matching Elasticsearch ID limitations. |
| username | Basic authentication user name. |
| password | Basic authentication password. |
| passwordKey | Reference to password key (or actual key) for encrypted passwords. See the API Documentation for encryption instructions. |
| passwordKeySource | Source of password key for encrypted passwords. See the API Documentation for encryption instructions. |
| sourceReferenceField |
Name of source field that will be mapped to the Elasticsearch id field.
Default is the document reference the Committer stores as
document.reference. The metadata source field is deleted,
unless keep is set to true.
|
| sourceContentField |
Source field name for a document content/body. Default is not a field,
but rather the document body content. Once re-mapped, the metadata
source field is deleted, unless keep is set to
true.
|
| targetContentField |
Target field name for a document content/body. Default is:
content.
|
| queueDir |
Optional path where to queue files before sending them to Elasticsearch.
Default is: ./committer-queue.
|
| queueSize |
Optional maximum queue size before sending document to Elasticsearch.
Default is: 1000.
|
| commitBatchSize |
Optional maximum of documents to send to Elasticsearch at once.
Default is: 100.
|
| maxRetries | Maximum retries upon commit failures. Default is 0 (no retry). |
| maxRetryWait | Maximum delay (millisecond) between retries. Default is 0 (no delay). |