When used with a Norconex Crawler,
you can use the following XML to configure
Amazon CloudSearch as the <committer>
section of your
Norconex Crawler configuration:
<committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter"> <!-- Mandatory: --> <documentEndpoint>...</documentEndpoint> <!-- Mandatory if not configured elsewhere: --> <accessKey>...</accessKey> <secretKey>...</secretKey> <!-- Optional settings: --> <fixBadIds>[false|true]</fixBadIds> <!-- Proxy (since 1.4.0) --> <proxyHost>...</proxyHost> <proxyPort>...</proxyPort> <proxyUsername>...</proxyUsername> <proxyPassword>...</proxyPassword> <!-- Use the following if password is encrypted. --> <proxyPasswordKey>...</proxyPasswordKey> <proxyPasswordKeySource>...</proxyPasswordKeySource> <sourceReferenceField keep="[false|true]">...</sourceReferenceField> <sourceContentField keep="[false|true]">...</sourceContentField> <targetContentField>...</targetContentField> <commitBatchSize>...</commitBatchSize> <queueDir>...</queueDir> <queueSize>...</queueSize> <maxRetries>...</maxRetries> <maxRetryWait>...</maxRetryWait> </committer>
Tag descriptions:
Tag | Description |
---|---|
documentEndpoint | CloudSearch document endpoint (where to send documents for indexing). |
accessKey | Optional CloudSearch access key. Will be taken from environment when blank. |
secretKey | Optional CloudSearch secret key. Will be taken from environment when blank. |
fixBadIds | Flag to fix ids not matching CloudSearch ID limitations. |
proxyHost | Optional proxy host. |
proxyPort | Optional proxy port. |
proxyUsername | Optional proxy username. |
proxyPassword | Optional proxy password. |
proxyPasswordKey | Optional proxy password key if password is encrypted. Refer to the API Documentation for more details. |
proxyPasswordKeySource |
Optional password encryption key source.
One of key , file , environment ,
or property .
Refer to the
API Documentation for more details.
|
sourceReferenceField | Name of source field that will be mapped to the CloudSearch target
id field. Default is the document reference the Committer stores as
committer.reference . Once re-mapped, the metadata source
field is deleted, unless keep is set to true. |
targetReferenceField | Name of target id field. Default is id . |
sourceContentField | CloudSearch source field name for a document content/body.
Default is not a field, but rather the document body content.
Once re-mapped, the metadata source field is deleted, unless
keep is set to true. |
targetContentField | CloudSearch target field name for a document content/body. Default is: content. |
queueDir | Path where to queue files before sending them to CloudSearch. Default is: ./committer-queue |
queueSize | Number of documents or deletes to queue before sending to CloudSearch. Default is: 1000. |
commitBatchSize | Maximum number of documents to send CloudSearch at once. Default is: 100. |
maxRetries | Maximum number of retries upon commit failures. Default is: 0 (no retry). |
maxRetryWait | Delay between retries. Default is: 0 (no delay). |