public class GenericURLNormalizer extends Object implements IURLNormalizer, IXMLConfigurable
Generic implementation of IURLNormalizer
that should satisfy
most URL normalization needs. This implementation relies on
URLNormalizer
. Please refer to it for complete documentation and
examples.
This class is in effect by default. To skip its usage, you
can explicitly set the URL Normalizer to null
in the
HttpCrawlerConfig
, or you can disable it using
setDisabled(boolean)
.
By default, this class removes the URL fragment and applies these RFC 3986 normalizations:
To overwrite this default, you have to specify a new list of normalizations
to apply, via the setNormalizations(Normalization...)
method,
or via XML configuration. Each
normalizations is identified by a code name. The following is the
complete code name list for supported normalizations. Click on any code
name to get a full description from URLNormalizer
:
addDirectoryTrailingSlash
(since 2.6.0)addDomainTrailingSlash
(since 2.6.1)addWWW
decodeUnreservedCharacters
encodeNonURICharacters
encodeSpaces
lowerCaseSchemeHost
removeDefaultPort
removeDirectoryIndex
removeDotSegments
removeDuplicateSlashes
removeEmptyParameters
removeFragment
removeSessionIds
removeTrailingQuestionMark
removeTrailingSlash
(since 2.6.0)removeTrailingHash
(since 2.7.0)removeWWW
replaceIPWithDomainName
secureScheme
sortQueryParameters
unsecureScheme
upperCaseEscapeSequence
removeQueryString
(since 2.9.0)lowerCase
(since 2.9.0)lowerCasePath
(since 2.9.0)lowerCaseQuery
(since 2.9.0)lowerCaseQueryParameterNames
(since 2.9.0)lowerCaseQueryParameterValues
(since 2.9.0)In addition, this class allows you to specify any number of URL value replacements using regular expressions.
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer" disabled="[false|true]"> <normalizations> (normalization code names, coma separated) </normalizations> <replacements> <replace> <match>(regex pattern to match)</match> <replacement>(optional replacement value, default to blank)</replacement> </replace> (... repeat replace tag as needed ...) </replacements> </urlNormalizer>
Since 2.7.2, having an empty "normalizations" tag will effectively remove any normalizations rules previously set (like default ones). Not having the tag at all will keep existing/default normalizations.
The following adds a normalization to add "www." to URL domains when missing, to the default set of normalizations. It also add custom URL "search-and-replace" to remove any "&view=print" strings from URLs as well as replace "&type=summary" with "&type=full".
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer"> <normalizations> removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters, addWWW </normalizations> <replacements> <replace><match>&view=print</match></replace> <replace> <match>(&type=)(summary)</match> <replacement>$1full</replacement> </replace> </replacements> </urlNormalizer>
Modifier and Type | Class and Description |
---|---|
static class |
GenericURLNormalizer.Normalization |
static class |
GenericURLNormalizer.Replace |
Constructor and Description |
---|
GenericURLNormalizer() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
GenericURLNormalizer.Normalization[] |
getNormalizations() |
GenericURLNormalizer.Replace[] |
getReplaces() |
int |
hashCode() |
boolean |
isDisabled()
Whether this URL Normalizer is disabled or not.
|
void |
loadFromXML(Reader in) |
String |
normalizeURL(String url)
Normalize the given URL.
|
void |
saveToXML(Writer out) |
void |
setDisabled(boolean disabled)
Sets whether this URL Normalizer is disabled or not.
|
void |
setNormalizations(GenericURLNormalizer.Normalization... normalizations) |
void |
setReplaces(GenericURLNormalizer.Replace... replaces) |
String |
toString() |
public String normalizeURL(String url)
IURLNormalizer
normalizeURL
in interface IURLNormalizer
url
- the URL to normalizepublic GenericURLNormalizer.Normalization[] getNormalizations()
public void setNormalizations(GenericURLNormalizer.Normalization... normalizations)
public GenericURLNormalizer.Replace[] getReplaces()
public void setReplaces(GenericURLNormalizer.Replace... replaces)
public boolean isDisabled()
true
if disabledpublic void setDisabled(boolean disabled)
disabled
- true
if disabledpublic void loadFromXML(Reader in)
loadFromXML
in interface IXMLConfigurable
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
Copyright © 2009–2021 Norconex Inc.. All rights reserved.