Class GenericURLNormalizer
- All Implemented Interfaces:
IURLNormalizer,IXMLConfigurable
Generic implementation of IURLNormalizer that should satisfy
most URL normalization needs. This implementation relies on
URLNormalizer. Please refer to it for complete documentation and
examples.
This class is in effect by default. To skip its usage, you
can explicitly set the URL Normalizer to null in the
HttpCrawlerConfig, or you can disable it using
setDisabled(boolean).
By default, this class removes the URL fragment and applies these RFC 3986 normalizations:
- Converting the scheme and host to lower case
- Capitalizing letters in escape sequences
- Decoding percent-encoded unreserved characters
- Removing the default port
- Encoding non-URI characters
To overwrite this default, you have to specify a new list of normalizations
to apply, via the setNormalizations(Normalization...) method,
or via XML configuration. Each
normalizations is identified by a code name. The following is the
complete code name list for supported normalizations. Click on any code
name to get a full description from URLNormalizer:
addDirectoryTrailingSlash(since 2.6.0)addDomainTrailingSlash(since 2.6.1)addWWWdecodeUnreservedCharactersencodeNonURICharactersencodeSpaceslowerCase(since 2.9.0)lowerCasePath(since 2.9.0)lowerCaseQuery(since 2.9.0)lowerCaseQueryParameterNames(since 2.9.0)lowerCaseQueryParameterValues(since 2.9.0)lowerCaseSchemeHostremoveDefaultPortremoveDirectoryIndexremoveDotSegmentsremoveDuplicateSlashesremoveEmptyParametersremoveFragmentremoveQueryString(since 2.9.0)removeSessionIdsremoveTrailingFragment(since 3.1.0)removeTrailingQuestionMarkremoveTrailingSlash(since 2.6.0)removeTrailingHash(since 2.7.0)removeWWWreplaceIPWithDomainNamesecureSchemesortQueryParametersunsecureSchemeupperCaseEscapeSequence
In addition, this class allows you to specify any number of URL value replacements using regular expressions.
XML configuration usage:
<urlNormalizer
class="com.norconex.collector.http.url.impl.GenericURLNormalizer"
disabled="[false|true]">
<normalizations>(normalization code names, coma separated)</normalizations>
<replacements>
<replace>
<match>(regex pattern to match)</match>
<replacement>(optional replacement value, default to blank)</replacement>
</replace>
(... repeat replace tag as needed ...)
</replacements>
</urlNormalizer>
Since 2.7.2, having an empty "normalizations" tag will effectively remove any normalizations rules previously set (like default ones). Not having the tag at all will keep existing/default normalizations.
XML usage example:
<urlNormalizer
class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
<normalizations>
removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
decodeUnreservedCharacters, removeDefaultPort,
encodeNonURICharacters, addWWW
</normalizations>
<replacements>
<replace>
<match>&amp;view=print</match>
</replace>
<replace>
<match>(&amp;type=)(summary)</match>
<replacement>$1full</replacement>
</replace>
</replacements>
</urlNormalizer>
The following adds a normalization to add "www." to URL domains when missing, to the default set of normalizations. It also add custom URL "search-and-replace" to remove any "&view=print" strings from URLs as well as replace "&type=summary" with "&type=full".
- Author:
- Pascal Essiembre
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumstatic class -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbooleaninthashCode()booleanWhether this URL Normalizer is disabled or not.voidloadFromXML(XML xml) normalizeURL(String url) Normalize the given URL.voidvoidsetDisabled(boolean disabled) Sets whether this URL Normalizer is disabled or not.voidsetNormalizations(GenericURLNormalizer.Normalization... normalizations) voidsetNormalizations(List<GenericURLNormalizer.Normalization> normalizations) voidsetReplaces(GenericURLNormalizer.Replace... replaces) voidsetReplaces(List<GenericURLNormalizer.Replace> replaces) toString()
-
Constructor Details
-
GenericURLNormalizer
public GenericURLNormalizer()
-
-
Method Details
-
normalizeURL
Description copied from interface:IURLNormalizerNormalize the given URL.- Specified by:
normalizeURLin interfaceIURLNormalizer- Parameters:
url- the URL to normalize- Returns:
- the normalized URL
-
getNormalizations
-
setNormalizations
-
setNormalizations
-
getReplaces
-
setReplaces
-
setReplaces
-
isDisabled
public boolean isDisabled()Whether this URL Normalizer is disabled or not.- Returns:
trueif disabled- Since:
- 2.3.0
-
setDisabled
public void setDisabled(boolean disabled) Sets whether this URL Normalizer is disabled or not.- Parameters:
disabled-trueif disabled- Since:
- 2.3.0
-
loadFromXML
- Specified by:
loadFromXMLin interfaceIXMLConfigurable
-
saveToXML
- Specified by:
saveToXMLin interfaceIXMLConfigurable
-
equals
-
hashCode
public int hashCode() -
toString
-