public class CloudSearchCommitter extends AbstractMappedCommitter
Commits documents to Amazon CloudSearch.
An access key and security key are required to connect to interact with
CloudSearch. For enhanced security, it is best to use one of the methods
described in DefaultAWSCredentialsProviderChain
for setting them
(environment variables, system properties, profile file, etc).
Do not explicitly set "accessKey" and "secretKey" on this class if you
want to rely on safer methods.
As of this writing, CloudSearch has a 128 characters limitation
on its "id" field. In addition, certain characters are not allowed.
By default, an error will result from trying to submit
documents with an invalid ID. As of 1.3.0, you can get around this by
setting setFixBadIds(boolean)
to true
. It will
truncate references that are too long and append a hash code to it
representing the truncated part. It will also convert invalid
characters to underscore. This approach is not 100%
collision-free (uniqueness), but it should safely cover the vast
majority of cases.
If you want to keep the original (non-truncated) URL, make sure you set
AbstractMappedCommitter.setKeepSourceReferenceField(boolean)
to true
.
As of 1.4.0, it is possible to specify proxy settings, and optionally, have
the supplied password encrypted using EncryptionUtil
or
encrypt/decrypt scripts package with this library.
In order for the password to be decrypted properly by the crawler, you need
to specify the encryption key used to encrypt it. The key can be stored
in a few supported locations and a combination of
proxyPasswordKey
and proxyPasswordKeySource
must be specified to properly
locate the key. The supported sources are:
proxyPasswordKeySource |
proxyPasswordKey |
---|---|
key |
The actual encryption key. |
file |
Path to a file containing the encryption key. |
environment |
Name of an environment variable containing the key. |
property |
Name of a JVM system property containing the key. |
<committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter"> <!-- Mandatory: --> <serviceEndpoint>(CloudSearch service endpoint)</serviceEndpoint> <!-- Mandatory if not configured elsewhere: --> <accessKey> (Optional CloudSearch access key. Will be taken from environment when blank.) </accessKey> <secretKey> (Optional CloudSearch secret key. Will be taken from environment when blank.) </secretKey> <!-- Optional settings: --> <fixBadIds> [false|true](Forces references to fit into a CloudSearch id field.) </fixBadIds> <signingRegion>(CloudSearch signing region)</signingRegion> <proxyHost>...</proxyHost> <proxyPort>...</proxyPort> <proxyUsername>...</proxyUsername> <proxyPassword>...</proxyPassword> <!-- Use the following if password is encrypted. --> <proxyPasswordKey>(the encryption key or a reference to it)</proxyPasswordKey> <proxyPasswordKeySource>[key|file|environment|property]</proxyPasswordKeySource> <sourceReferenceField keep="[false|true]"> (Optional name of field that contains the document reference, when the default document reference is not used. The reference value will be mapped to CloudSearch "id" field, which is mandatory. Once re-mapped, this metadata source field is deleted, unless "keep" is set totrue
.) </sourceReferenceField> <sourceContentField keep="[false|true]"> (If you wish to use a metadata field to act as the document "content", you can specify that field here. Default does not take a metadata field but rather the document content. Once re-mapped, the metadata source field is deleted, unless "keep" is set totrue
.) </sourceContentField> <targetContentField> (CloudSearch target field name for a document content/body. Default is: content) </targetContentField> <commitBatchSize> (Max number of docs to send CloudSearch at once. If you experience memory problems, lower this number. Default is 100.) </commitBatchSize> <queueDir>(Optional path where to queue files)</queueDir> <queueSize> (Max queue size before committing. Default is 1000.) </queueSize> <maxRetries> (Max retries upon commit failures. Default is 0.) </maxRetries> <maxRetryWait> (Max delay in milliseconds between retries. Default is 0.) </maxRetryWait> </committer>
XML configuration entries expecting millisecond durations
can be provided in human-readable format (English only), as per
DurationParser
(e.g., "5 minutes and 30 seconds" or "5m30s").
Modifier and Type | Field and Description |
---|---|
static String |
COULDSEARCH_ID_FIELD
CloudSearch mandatory ID field
|
static String |
DEFAULT_COULDSEARCH_CONTENT_FIELD
Default CloudSearch content field
|
static Pattern |
FIELD_PATTERN
CouldSearch mandatory field pattern.
|
DEFAULT_COMMIT_BATCH_SIZE
DEFAULT_QUEUE_DIR, filesCommitting
DEFAULT_QUEUE_SIZE, queueSize
Constructor and Description |
---|
CloudSearchCommitter() |
CloudSearchCommitter(String serviceEndpoint) |
CloudSearchCommitter(String serviceEndpoint,
String signingRegion) |
Modifier and Type | Method and Description |
---|---|
protected com.amazonaws.ClientConfiguration |
buildClientConfiguration() |
protected void |
commitBatch(List<ICommitOperation> batch) |
boolean |
equals(Object obj) |
String |
getAccessKey()
Gets the CloudSearch access key.
|
String |
getDocumentEndpoint()
Deprecated.
Since 1.2.0, use
setServiceEndpoint(String) |
String |
getProxyHost()
Gets the proxy host.
|
String |
getProxyPassword()
Gets the proxy password.
|
EncryptionKey |
getProxyPasswordKey()
Gets the proxy password encryption key.
|
int |
getProxyPort()
Gets the proxy port.
|
String |
getProxyUsername()
Gets the proxy username.
|
String |
getSecretKey()
Gets the CloudSearch secret key.
|
String |
getServiceEndpoint()
Gets AWS service endpoint.
|
String |
getSigningRegion()
Gets the AWS signing region.
|
int |
hashCode() |
boolean |
isFixBadIds()
Gets whether to fix IDs that are too long for CloudSearch
ID limitation (128 characters max).
|
protected void |
loadFromXml(XMLConfiguration xml) |
protected void |
saveToXML(XMLStreamWriter writer) |
void |
setAccessKey(String accessKey)
Sets the CloudSearch access key.
|
void |
setDocumentEndpoint(String documentEndpoint)
Deprecated.
Since 1.2.0, use
getServiceEndpoint() |
void |
setFixBadIds(boolean fixBadIds)
Sets whether to fix IDs that are too long for CloudSearch
ID limitation (128 characters max).
|
void |
setProxyHost(String proxyHost)
Sets the proxy host.
|
void |
setProxyPassword(String proxyPassword)
Sets the proxy password.
|
void |
setProxyPasswordKey(EncryptionKey proxyPasswordKey)
Sets the proxy password encryption key.
|
void |
setProxyPort(int proxyPort)
Sets the proxy port.
|
void |
setProxyUsername(String proxyUsername)
Sets the proxy username
|
void |
setSecretKey(String secretKey)
Sets the CloudSearch secret key.
|
void |
setServiceEndpoint(String serviceEndpoint)
Sets AWS service endpoint.
|
void |
setSigningRegion(String signingRegion)
Gets the AWS signing region.
|
void |
setTargetReferenceField(String targetReferenceField)
This method is not supported and will throw an
UnsupportedOperationException if invoked. |
String |
toString() |
getSourceContentField, getSourceReferenceField, getTargetContentField, getTargetReferenceField, isKeepSourceContentField, isKeepSourceReferenceField, loadFromXML, prepareCommitAddition, saveToXML, setKeepSourceContentField, setKeepSourceReferenceField, setSourceContentField, setSourceReferenceField, setTargetContentField
commitAddition, commitComplete, commitDeletion, getCommitBatchSize, getMaxRetries, getMaxRetryWait, setCommitBatchSize, setMaxRetries, setMaxRetryWait
commit, getInitialQueueDocCount, getQueueDir, prepareCommitDeletion, queueAddition, queueRemoval, setQueueDir
add, getQueueSize, remove, setQueueSize
public static final Pattern FIELD_PATTERN
public static final String COULDSEARCH_ID_FIELD
public static final String DEFAULT_COULDSEARCH_CONTENT_FIELD
public CloudSearchCommitter()
public CloudSearchCommitter(String serviceEndpoint)
public String getServiceEndpoint()
public void setServiceEndpoint(String serviceEndpoint)
serviceEndpoint
- AWS service endpoingpublic String getSigningRegion()
public void setSigningRegion(String signingRegion)
signingRegion
- the AWS signing region@Deprecated public String getDocumentEndpoint()
setServiceEndpoint(String)
@Deprecated public void setDocumentEndpoint(String documentEndpoint)
getServiceEndpoint()
documentEndpoint
- document endpointpublic String getAccessKey()
null
, the access key
will be obtained from the environment, as detailed in
DefaultAWSCredentialsProviderChain
.public void setAccessKey(String accessKey)
null
, the access key
will be obtained from the environment, as detailed in
DefaultAWSCredentialsProviderChain
.accessKey
- the access keypublic String getSecretKey()
null
, the secret key
will be obtained from the environment, as detailed in
DefaultAWSCredentialsProviderChain
.public void setSecretKey(String secretKey)
null
, the secret key
will be obtained from the environment, as detailed in
DefaultAWSCredentialsProviderChain
.secretKey
- the secret keypublic void setTargetReferenceField(String targetReferenceField)
UnsupportedOperationException
if invoked. With CloudSearch,
the target field for a document unique id is always "id".setTargetReferenceField
in class AbstractMappedCommitter
targetReferenceField
- the target fieldpublic boolean isFixBadIds()
true
,
long IDs will be truncated and a hash code representing the
truncated part will be appended.true
to fix IDs that are too longpublic void setFixBadIds(boolean fixBadIds)
true
,
long IDs will be truncated and a hash code representing the
truncated part will be appended.fixBadIds
- true
to fix IDs that are too longpublic String getProxyHost()
public void setProxyHost(String proxyHost)
proxyHost
- proxy hostpublic int getProxyPort()
public void setProxyPort(int proxyPort)
proxyPort
- proxy portpublic String getProxyUsername()
public void setProxyUsername(String proxyUsername)
proxyUsername
- proxy usernamepublic String getProxyPassword()
public void setProxyPassword(String proxyPassword)
proxyPassword
- proxy passwordpublic EncryptionKey getProxyPasswordKey()
null
if the password is not
encrypted.EncryptionUtil
public void setProxyPasswordKey(EncryptionKey proxyPasswordKey)
proxyPasswordKey
- password keyEncryptionUtil
protected void commitBatch(List<ICommitOperation> batch)
commitBatch
in class AbstractBatchCommitter
protected com.amazonaws.ClientConfiguration buildClientConfiguration()
protected void saveToXML(XMLStreamWriter writer) throws XMLStreamException
saveToXML
in class AbstractMappedCommitter
XMLStreamException
protected void loadFromXml(XMLConfiguration xml)
loadFromXml
in class AbstractMappedCommitter
public int hashCode()
hashCode
in class AbstractMappedCommitter
public boolean equals(Object obj)
equals
in class AbstractMappedCommitter
public String toString()
toString
in class AbstractMappedCommitter
Copyright © 2009–2020 Norconex Inc.. All rights reserved.