When used with a Norconex Crawler,
you can use the following XML to configure
Amazon CloudSearch as the <committer> section of your
Norconex Crawler configuration:
<committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter"> <!-- Mandatory: --> <documentEndpoint>...</documentEndpoint> <!-- Mandatory if not configured elsewhere: --> <accessKey>...</accessKey> <secretKey>...</secretKey> <!-- Optional settings: --> <fixBadIds>[false|true]</fixBadIds> <!-- Proxy (since 1.4.0) --> <proxyHost>...</proxyHost> <proxyPort>...</proxyPort> <proxyUsername>...</proxyUsername> <proxyPassword>...</proxyPassword> <!-- Use the following if password is encrypted. --> <proxyPasswordKey>...</proxyPasswordKey> <proxyPasswordKeySource>...</proxyPasswordKeySource> <sourceReferenceField keep="[false|true]">...</sourceReferenceField> <sourceContentField keep="[false|true]">...</sourceContentField> <targetContentField>...</targetContentField> <commitBatchSize>...</commitBatchSize> <queueDir>...</queueDir> <queueSize>...</queueSize> <maxRetries>...</maxRetries> <maxRetryWait>...</maxRetryWait> </committer>
Tag descriptions:
| Tag | Description |
|---|---|
| documentEndpoint | CloudSearch document endpoint (where to send documents for indexing). |
| accessKey | Optional CloudSearch access key. Will be taken from environment when blank. |
| secretKey | Optional CloudSearch secret key. Will be taken from environment when blank. |
| fixBadIds | Flag to fix ids not matching CloudSearch ID limitations. |
| proxyHost | Optional proxy host. |
| proxyPort | Optional proxy port. |
| proxyUsername | Optional proxy username. |
| proxyPassword | Optional proxy password. |
| proxyPasswordKey | Optional proxy password key if password is encrypted. Refer to the API Documentation for more details. |
| proxyPasswordKeySource |
Optional password encryption key source.
One of key, file, environment,
or property.
Refer to the
API Documentation for more details.
|
| sourceReferenceField | Name of source field that will be mapped to the CloudSearch target
id field. Default is the document reference the Committer stores as
committer.reference. Once re-mapped, the metadata source
field is deleted, unless keep is set to true. |
| targetReferenceField | Name of target id field. Default is id. |
| sourceContentField | CloudSearch source field name for a document content/body.
Default is not a field, but rather the document body content.
Once re-mapped, the metadata source field is deleted, unless
keep is set to true. |
| targetContentField | CloudSearch target field name for a document content/body. Default is: content. |
| queueDir | Path where to queue files before sending them to CloudSearch. Default is: ./committer-queue |
| queueSize | Number of documents or deletes to queue before sending to CloudSearch. Default is: 1000. |
| commitBatchSize | Maximum number of documents to send CloudSearch at once. Default is: 100. |
| maxRetries | Maximum number of retries upon commit failures. Default is: 0 (no retry). |
| maxRetryWait | Delay between retries. Default is: 0 (no delay). |