Class GenericURLNormalizer
- java.lang.Object
-
- com.norconex.collector.http.url.impl.GenericURLNormalizer
-
- All Implemented Interfaces:
IURLNormalizer,IXMLConfigurable
public class GenericURLNormalizer extends Object implements IURLNormalizer, IXMLConfigurable
Generic implementation of
IURLNormalizerthat should satisfy most URL normalization needs. This implementation relies onURLNormalizer. Please refer to it for complete documentation and examples.This class is in effect by default. To skip its usage, you can explicitly set the URL Normalizer to
nullin theHttpCrawlerConfig, or you can disable it usingsetDisabled(boolean).By default, this class removes the URL fragment and applies these RFC 3986 normalizations:
- Converting the scheme and host to lower case
- Capitalizing letters in escape sequences
- Decoding percent-encoded unreserved characters
- Removing the default port
- Encoding non-URI characters
To overwrite this default, you have to specify a new list of normalizations to apply, via the
setNormalizations(Normalization...)method, or via XML configuration. Each normalizations is identified by a code name. The following is the complete code name list for supported normalizations. Click on any code name to get a full description fromURLNormalizer:addDirectoryTrailingSlash(since 2.6.0)addDomainTrailingSlash(since 2.6.1)addWWWdecodeUnreservedCharactersencodeNonURICharactersencodeSpaceslowerCase(since 2.9.0)lowerCasePath(since 2.9.0)lowerCaseQuery(since 2.9.0)lowerCaseQueryParameterNames(since 2.9.0)lowerCaseQueryParameterValues(since 2.9.0)lowerCaseSchemeHostremoveDefaultPortremoveDirectoryIndexremoveDotSegmentsremoveDuplicateSlashesremoveEmptyParametersremoveFragmentremoveQueryString(since 2.9.0)removeSessionIdsremoveTrailingFragment(since 3.1.0)removeTrailingQuestionMarkremoveTrailingSlash(since 2.6.0)removeTrailingHash(since 2.7.0)removeWWWreplaceIPWithDomainNamesecureSchemesortQueryParametersunsecureSchemeupperCaseEscapeSequence
In addition, this class allows you to specify any number of URL value replacements using regular expressions.
XML configuration usage:
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer" disabled="[false|true]"> <normalizations>(normalization code names, coma separated)</normalizations> <replacements> <replace> <match>(regex pattern to match)</match> <replacement>(optional replacement value, default to blank)</replacement> </replace> (... repeat replace tag as needed ...) </replacements> </urlNormalizer>Since 2.7.2, having an empty "normalizations" tag will effectively remove any normalizations rules previously set (like default ones). Not having the tag at all will keep existing/default normalizations.
XML usage example:
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer"> <normalizations> removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters, addWWW </normalizations> <replacements> <replace> <match>&amp;view=print</match> </replace> <replace> <match>(&amp;type=)(summary)</match> <replacement>$1full</replacement> </replace> </replacements> </urlNormalizer>The following adds a normalization to add "www." to URL domains when missing, to the default set of normalizations. It also add custom URL "search-and-replace" to remove any "&view=print" strings from URLs as well as replace "&type=summary" with "&type=full".
- Author:
- Pascal Essiembre
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classGenericURLNormalizer.Normalizationstatic classGenericURLNormalizer.Replace
-
Constructor Summary
Constructors Constructor Description GenericURLNormalizer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description booleanequals(Object other)List<GenericURLNormalizer.Normalization>getNormalizations()List<GenericURLNormalizer.Replace>getReplaces()inthashCode()booleanisDisabled()Whether this URL Normalizer is disabled or not.voidloadFromXML(XML xml)StringnormalizeURL(String url)Normalize the given URL.voidsaveToXML(XML xml)voidsetDisabled(boolean disabled)Sets whether this URL Normalizer is disabled or not.voidsetNormalizations(GenericURLNormalizer.Normalization... normalizations)voidsetNormalizations(List<GenericURLNormalizer.Normalization> normalizations)voidsetReplaces(GenericURLNormalizer.Replace... replaces)voidsetReplaces(List<GenericURLNormalizer.Replace> replaces)StringtoString()
-
-
-
Method Detail
-
normalizeURL
public String normalizeURL(String url)
Description copied from interface:IURLNormalizerNormalize the given URL.- Specified by:
normalizeURLin interfaceIURLNormalizer- Parameters:
url- the URL to normalize- Returns:
- the normalized URL
-
getNormalizations
public List<GenericURLNormalizer.Normalization> getNormalizations()
-
setNormalizations
public void setNormalizations(GenericURLNormalizer.Normalization... normalizations)
-
setNormalizations
public void setNormalizations(List<GenericURLNormalizer.Normalization> normalizations)
-
getReplaces
public List<GenericURLNormalizer.Replace> getReplaces()
-
setReplaces
public void setReplaces(GenericURLNormalizer.Replace... replaces)
-
setReplaces
public void setReplaces(List<GenericURLNormalizer.Replace> replaces)
-
isDisabled
public boolean isDisabled()
Whether this URL Normalizer is disabled or not.- Returns:
trueif disabled- Since:
- 2.3.0
-
setDisabled
public void setDisabled(boolean disabled)
Sets whether this URL Normalizer is disabled or not.- Parameters:
disabled-trueif disabled- Since:
- 2.3.0
-
loadFromXML
public void loadFromXML(XML xml)
- Specified by:
loadFromXMLin interfaceIXMLConfigurable
-
saveToXML
public void saveToXML(XML xml)
- Specified by:
saveToXMLin interfaceIXMLConfigurable
-
-