public class GenericURLNormalizer extends Object implements IURLNormalizer, IXMLConfigurable
Generic implementation of IURLNormalizer that should satisfy
most URL normalization needs. This implementation relies on
URLNormalizer. Please refer to it for complete documentation and
examples.
This class is in effect by default. To skip its usage, you
can explicitly set the URL Normalizer to null in the
HttpCrawlerConfig, or you can disable it using
setDisabled(boolean).
By default, this class removes the URL fragment and applies these RFC 3986 normalizations:
To overwrite this default, you have to specify a new list of normalizations
to apply, via the setNormalizations(Normalization...) method,
or via XML configuration. Each
normalizations is identified by a code name. The following is the
complete code name list for supported normalizations. Click on any code
name to get a full description from URLNormalizer:
addDirectoryTrailingSlash (since 2.6.0)addDomainTrailingSlash (since 2.6.1)addWWWdecodeUnreservedCharactersencodeNonURICharactersencodeSpaceslowerCaseSchemeHostremoveDefaultPortremoveDirectoryIndexremoveDotSegmentsremoveDuplicateSlashesremoveEmptyParametersremoveFragmentremoveSessionIdsremoveTrailingQuestionMarkremoveTrailingSlash (since 2.6.0)removeTrailingHash (since 2.7.0)removeWWWreplaceIPWithDomainNamesecureSchemesortQueryParametersunsecureSchemeupperCaseEscapeSequenceremoveQueryString (since 2.9.0)lowerCase (since 2.9.0)lowerCasePath (since 2.9.0)lowerCaseQuery (since 2.9.0)lowerCaseQueryParameterNames (since 2.9.0)lowerCaseQueryParameterValues (since 2.9.0)In addition, this class allows you to specify any number of URL value replacements using regular expressions.
<urlNormalizer
class="com.norconex.collector.http.url.impl.GenericURLNormalizer"
disabled="[false|true]">
<normalizations>
(normalization code names, coma separated)
</normalizations>
<replacements>
<replace>
<match>(regex pattern to match)</match>
<replacement>(optional replacement value, default to blank)</replacement>
</replace>
(... repeat replace tag as needed ...)
</replacements>
</urlNormalizer>
Since 2.7.2, having an empty "normalizations" tag will effectively remove any normalizations rules previously set (like default ones). Not having the tag at all will keep existing/default normalizations.
The following adds a normalization to add "www." to URL domains when missing, to the default set of normalizations. It also add custom URL "search-and-replace" to remove any "&view=print" strings from URLs as well as replace "&type=summary" with "&type=full".
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
<normalizations>
removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
decodeUnreservedCharacters, removeDefaultPort,
encodeNonURICharacters, addWWW
</normalizations>
<replacements>
<replace><match>&view=print</match></replace>
<replace>
<match>(&type=)(summary)</match>
<replacement>$1full</replacement>
</replace>
</replacements>
</urlNormalizer>
| Modifier and Type | Class and Description |
|---|---|
static class |
GenericURLNormalizer.Normalization |
static class |
GenericURLNormalizer.Replace |
| Constructor and Description |
|---|
GenericURLNormalizer() |
| Modifier and Type | Method and Description |
|---|---|
boolean |
equals(Object other) |
GenericURLNormalizer.Normalization[] |
getNormalizations() |
GenericURLNormalizer.Replace[] |
getReplaces() |
int |
hashCode() |
boolean |
isDisabled()
Whether this URL Normalizer is disabled or not.
|
void |
loadFromXML(Reader in) |
String |
normalizeURL(String url)
Normalize the given URL.
|
void |
saveToXML(Writer out) |
void |
setDisabled(boolean disabled)
Sets whether this URL Normalizer is disabled or not.
|
void |
setNormalizations(GenericURLNormalizer.Normalization... normalizations) |
void |
setReplaces(GenericURLNormalizer.Replace... replaces) |
String |
toString() |
public String normalizeURL(String url)
IURLNormalizernormalizeURL in interface IURLNormalizerurl - the URL to normalizepublic GenericURLNormalizer.Normalization[] getNormalizations()
public void setNormalizations(GenericURLNormalizer.Normalization... normalizations)
public GenericURLNormalizer.Replace[] getReplaces()
public void setReplaces(GenericURLNormalizer.Replace... replaces)
public boolean isDisabled()
true if disabledpublic void setDisabled(boolean disabled)
disabled - true if disabledpublic void loadFromXML(Reader in)
loadFromXML in interface IXMLConfigurablepublic void saveToXML(Writer out) throws IOException
saveToXML in interface IXMLConfigurableIOExceptionCopyright © 2009–2021 Norconex Inc.. All rights reserved.