Class GenericURLNormalizer
- java.lang.Object
-
- com.norconex.collector.http.url.impl.GenericURLNormalizer
-
- All Implemented Interfaces:
IURLNormalizer
,IXMLConfigurable
public class GenericURLNormalizer extends Object implements IURLNormalizer, IXMLConfigurable
Generic implementation of
IURLNormalizer
that should satisfy most URL normalization needs. This implementation relies onURLNormalizer
. Please refer to it for complete documentation and examples.This class is in effect by default. To skip its usage, you can explicitly set the URL Normalizer to
null
in theHttpCrawlerConfig
, or you can disable it usingsetDisabled(boolean)
.By default, this class removes the URL fragment and applies these RFC 3986 normalizations:
- Converting the scheme and host to lower case
- Capitalizing letters in escape sequences
- Decoding percent-encoded unreserved characters
- Removing the default port
- Encoding non-URI characters
To overwrite this default, you have to specify a new list of normalizations to apply, via the
setNormalizations(Normalization...)
method, or via XML configuration. Each normalizations is identified by a code name. The following is the complete code name list for supported normalizations. Click on any code name to get a full description fromURLNormalizer
:addDirectoryTrailingSlash
(since 2.6.0)addDomainTrailingSlash
(since 2.6.1)addWWW
decodeUnreservedCharacters
encodeNonURICharacters
encodeSpaces
lowerCase
(since 2.9.0)lowerCasePath
(since 2.9.0)lowerCaseQuery
(since 2.9.0)lowerCaseQueryParameterNames
(since 2.9.0)lowerCaseQueryParameterValues
(since 2.9.0)lowerCaseSchemeHost
removeDefaultPort
removeDirectoryIndex
removeDotSegments
removeDuplicateSlashes
removeEmptyParameters
removeFragment
removeQueryString
(since 2.9.0)removeSessionIds
removeTrailingFragment
(since 3.1.0)removeTrailingQuestionMark
removeTrailingSlash
(since 2.6.0)removeTrailingHash
(since 2.7.0)removeWWW
replaceIPWithDomainName
secureScheme
sortQueryParameters
unsecureScheme
upperCaseEscapeSequence
In addition, this class allows you to specify any number of URL value replacements using regular expressions.
XML configuration usage:
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer" disabled="[false|true]"> <normalizations>(normalization code names, coma separated)</normalizations> <replacements> <replace> <match>(regex pattern to match)</match> <replacement>(optional replacement value, default to blank)</replacement> </replace> (... repeat replace tag as needed ...) </replacements> </urlNormalizer>
Since 2.7.2, having an empty "normalizations" tag will effectively remove any normalizations rules previously set (like default ones). Not having the tag at all will keep existing/default normalizations.
XML usage example:
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer"> <normalizations> removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters, addWWW </normalizations> <replacements> <replace> <match>&amp;view=print</match> </replace> <replace> <match>(&amp;type=)(summary)</match> <replacement>$1full</replacement> </replace> </replacements> </urlNormalizer>
The following adds a normalization to add "www." to URL domains when missing, to the default set of normalizations. It also add custom URL "search-and-replace" to remove any "&view=print" strings from URLs as well as replace "&type=summary" with "&type=full".
- Author:
- Pascal Essiembre
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
GenericURLNormalizer.Normalization
static class
GenericURLNormalizer.Replace
-
Constructor Summary
Constructors Constructor Description GenericURLNormalizer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
equals(Object other)
List<GenericURLNormalizer.Normalization>
getNormalizations()
List<GenericURLNormalizer.Replace>
getReplaces()
int
hashCode()
boolean
isDisabled()
Whether this URL Normalizer is disabled or not.void
loadFromXML(XML xml)
String
normalizeURL(String url)
Normalize the given URL.void
saveToXML(XML xml)
void
setDisabled(boolean disabled)
Sets whether this URL Normalizer is disabled or not.void
setNormalizations(GenericURLNormalizer.Normalization... normalizations)
void
setNormalizations(List<GenericURLNormalizer.Normalization> normalizations)
void
setReplaces(GenericURLNormalizer.Replace... replaces)
void
setReplaces(List<GenericURLNormalizer.Replace> replaces)
String
toString()
-
-
-
Method Detail
-
normalizeURL
public String normalizeURL(String url)
Description copied from interface:IURLNormalizer
Normalize the given URL.- Specified by:
normalizeURL
in interfaceIURLNormalizer
- Parameters:
url
- the URL to normalize- Returns:
- the normalized URL
-
getNormalizations
public List<GenericURLNormalizer.Normalization> getNormalizations()
-
setNormalizations
public void setNormalizations(GenericURLNormalizer.Normalization... normalizations)
-
setNormalizations
public void setNormalizations(List<GenericURLNormalizer.Normalization> normalizations)
-
getReplaces
public List<GenericURLNormalizer.Replace> getReplaces()
-
setReplaces
public void setReplaces(GenericURLNormalizer.Replace... replaces)
-
setReplaces
public void setReplaces(List<GenericURLNormalizer.Replace> replaces)
-
isDisabled
public boolean isDisabled()
Whether this URL Normalizer is disabled or not.- Returns:
true
if disabled- Since:
- 2.3.0
-
setDisabled
public void setDisabled(boolean disabled)
Sets whether this URL Normalizer is disabled or not.- Parameters:
disabled
-true
if disabled- Since:
- 2.3.0
-
loadFromXML
public void loadFromXML(XML xml)
- Specified by:
loadFromXML
in interfaceIXMLConfigurable
-
saveToXML
public void saveToXML(XML xml)
- Specified by:
saveToXML
in interfaceIXMLConfigurable
-
-