public class MD5DocumentChecksummer extends AbstractDocumentChecksummer
Implementation of IDocumentChecksummer
which
returns a MD5 checksum value of the extracted document content unless
one or more given source fields are specified, in which case the MD5
checksum value is constructed from those fields. This checksum is normally
performed right after the document has been imported.
You have the option to keep the checksum as a document metadata field.
When AbstractDocumentChecksummer.setKeep(boolean)
is true
, the checksum will be
stored in the target field name specified. If you do not specify any,
it stores it under the metadata field name
CrawlDocMetadata.CHECKSUM_METADATA
.
Since 1.9.0, it is possible to use a combination of document content
and fields to create the checksum by setting
combineFieldsAndContent
to true
.
If you combine fields and content but you don't define a field matcher,
it will be the equivalent of adding all fields.
If you do not combine the two, specifying a field matcher
will ignore the content while specifying none will only use the content.
<documentChecksummer
class="com.norconex.collector.core.checksum.impl.MD5DocumentChecksummer"
combineFieldsAndContent="[false|true]"
keep="[false|true]"
toField="(optional metadata field to store the checksum)">
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(expression matching fields used to create the checksum)
</fieldMatcher>
</documentChecksummer>
toField
is ignored unless the keep
attribute is set to true
.
<documentChecksummer
class="MD5DocumentChecksummer"/>
The above example uses the document body (default) to make the checksum.
Since 2.0.0, a self-closing
<documentChecksummer/>
tag without any attributes
is used to disable checksum generation.
Constructor and Description |
---|
MD5DocumentChecksummer() |
Modifier and Type | Method and Description |
---|---|
String |
doCreateDocumentChecksum(Doc document) |
boolean |
equals(Object other) |
TextMatcher |
getFieldMatcher()
Gets the field matcher.
|
List<String> |
getSourceFields()
Deprecated.
Since 2.0.0, use
getFieldMatcher() . |
String |
getSourceFieldsRegex()
Deprecated.
Since 2.0.0, use
getFieldMatcher() . |
int |
hashCode() |
boolean |
isCombineFieldsAndContent()
Gets whether we are combining the fields and content checksums.
|
boolean |
isDisabled()
Deprecated.
Since 2.0.0, not having a checksummer defined or
setting one explicitly to
null effectively disables
it. |
protected void |
loadChecksummerFromXML(XML xml) |
protected void |
saveChecksummerToXML(XML xml) |
void |
setCombineFieldsAndContent(boolean combineFieldsAndContent)
Sets whether to combine the fields and content checksums.
|
void |
setDisabled(boolean disabled)
Deprecated.
Since 2.0.0, not having a checksummer defined or
setting one explicitly to
null effectively disable
it. |
void |
setFieldMatcher(TextMatcher fieldMatcher)
Sets the field matcher.
|
void |
setSourceFields(List<String> sourceFields)
Deprecated.
Since 2.0.0, use
setFieldMatcher(TextMatcher) . |
void |
setSourceFields(String... sourceFields)
Deprecated.
Since 2.0.0, use
setFieldMatcher(TextMatcher) . |
void |
setSourceFieldsRegex(String sourceFieldsRegex)
Deprecated.
Since 2.0.0, use
setFieldMatcher(TextMatcher) . |
String |
toString() |
createDocumentChecksum, getOnSet, getTargetField, getToField, isKeep, loadFromXML, saveToXML, setKeep, setOnSet, setTargetField, setToField
public String doCreateDocumentChecksum(Doc document)
doCreateDocumentChecksum
in class AbstractDocumentChecksummer
public TextMatcher getFieldMatcher()
public void setFieldMatcher(TextMatcher fieldMatcher)
fieldMatcher
- field matcher@Deprecated public List<String> getSourceFields()
getFieldMatcher()
.@Deprecated public void setSourceFields(String... sourceFields)
setFieldMatcher(TextMatcher)
.sourceFields
- fields to use to construct the checksum@Deprecated public void setSourceFields(List<String> sourceFields)
setFieldMatcher(TextMatcher)
.sourceFields
- fields to use to construct the checksum@Deprecated public String getSourceFieldsRegex()
getFieldMatcher()
.@Deprecated public void setSourceFieldsRegex(String sourceFieldsRegex)
setFieldMatcher(TextMatcher)
.sourceFieldsRegex
- regular expression@Deprecated public boolean isDisabled()
null
effectively disables
it.false
@Deprecated public void setDisabled(boolean disabled)
null
effectively disable
it.disabled
- argument is ignoredpublic boolean isCombineFieldsAndContent()
true
if combining fields and content checksumspublic void setCombineFieldsAndContent(boolean combineFieldsAndContent)
combineFieldsAndContent
- true
if combining fields
and content checksumsprotected void loadChecksummerFromXML(XML xml)
loadChecksummerFromXML
in class AbstractDocumentChecksummer
protected void saveChecksummerToXML(XML xml)
saveChecksummerToXML
in class AbstractDocumentChecksummer
public boolean equals(Object other)
equals
in class AbstractDocumentChecksummer
public int hashCode()
hashCode
in class AbstractDocumentChecksummer
public String toString()
toString
in class AbstractDocumentChecksummer
Copyright © 2014–2023 Norconex Inc.. All rights reserved.