Class URLNormalizer
- java.lang.Object
-
- com.norconex.commons.lang.url.URLNormalizer
-
- All Implemented Interfaces:
Serializable
public class URLNormalizer extends Object implements Serializable
The general idea behind URL normalization is to make different URLs "equivalent" (i.e. eliminate URL variations pointing to the same resource). To achieve this,
URLNormalizer
takes a URL and modifies it to its most basic or standard form (for the context in which it is used). Of courseURLNormalizer
can simply be used as a generic URL manipulation tool for your needs.You would typically "build" your normalized URL by invoking each method of interest, in the relevant order, using a similar approach:
String url = "Http://Example.com:80//foo/index.html"; URL normalizedURL = new URLNormalizer(url) .lowerCaseSchemeHost() .removeDefaultPort() .removeDuplicateSlashes() .removeDirectoryIndex() .addWWW() .toURL(); System.out.println(normalizedURL.toString()); // Output: http://www.example.com/foo/
Several normalization methods implemented come from the RFC 3986 standard. These standards and several more normalization techniques are very well summarized on the Wikipedia article titled URL Normalization. This class implements most normalizations described on that article and borrows several of its examples, as well as a few additional ones.
The normalization methods available can be broken down into three categories:
Preserving Semantics
The following normalizations are part of the RFC 3986 standard and should result in equivalent URLs (one that identifies the same resource):
Convert scheme and host to lower case
Convert escape sequence to upper case
Decode percent-encoded unreserved characters
Removing default ports
URL-Encode non-ASCII characters
Encode spaces to plus sign
Usually Preserving Semantics
The following techniques will generate a semantically equivalent URL for the majority of use cases but are not enforced as a standard.
Not Preserving Semantics
These normalizations will fail to produce semantically equivalent URLs in many cases. They usually work best when you have a good understanding of the web site behind the supplied URL and whether for that site, normalizations used can be be considered to produce semantically equivalent URLs or not.
Remove directory index
Remove fragment (#)
Remove trailing fragment (#)
Replace IP with domain name
Unsecure schema (https → http)
Secure schema (http → https)
Remove duplicate slashes
Remove "www."
Add "www."
Sort query parameters
Remove empty query parameters
Remove trailing question mark (?)
Remove session IDs
Remove query string
(since 1.15.1)Convert entire URL lower case
(since 1.15.1)Convert URL path lower case
(since 1.15.1)Convert URL query string parameter name and values to lower case
(since 1.15.1)Convert URL query parameter names to lower case
(since 1.15.1)Convert URL query parameter values to lower case
(since 1.15.1)
Refer to each methods below for description and examples (or click on a normalization name above).
- Author:
- Pascal Essiembre
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description URLNormalizer(String url)
Create a newURLNormalizer
instance.URLNormalizer(URL url)
Create a newURLNormalizer
instance.
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description URLNormalizer
addDirectoryTrailingSlash()
Adds a trailing slash (/) to a URL ending with a directory.URLNormalizer
addDomainTrailingSlash()
Adds a trailing slash (/) right after the domain for URLs with no path, before any fragment (#) or query string (?).URLNormalizer
addTrailingSlash()
Deprecated.Since 1.11.0, useaddDirectoryTrailingSlash()
URLNormalizer
addWWW()
Adds "www." domain name prefix.URLNormalizer
decodeUnreservedCharacters()
Decodes percent-encoded unreserved characters.URLNormalizer
encodeNonURICharacters()
Encodes all characters that are not supported characters in a URI (not to confuse with URL), as defined by the RFC 3986 standard.URLNormalizer
encodeSpaces()
Encodes space characters into plus signs (+) if they are part of the query string.URLNormalizer
lowerCase()
Converts the entire URL to lower case, including scheme, host name, path, query string parameter names and values.URLNormalizer
lowerCasePath()
Converts the URL path to lower case.URLNormalizer
lowerCaseQuery()
Converts the URL query string to lower case, which includes both the parameter names and values.URLNormalizer
lowerCaseQueryParameterNames()
Converts the URL query parameter names to lower case, leaving query parameter values intact.URLNormalizer
lowerCaseQueryParameterValues()
Converts the URL query parameter values to lower case, leaving query parameter names intact.URLNormalizer
lowerCaseSchemeHost()
Converts the scheme and host to lower case.URLNormalizer
removeDefaultPort()
Removes the default port (80 for http, and 443 for https).URLNormalizer
removeDirectoryIndex()
Removes directory index files.URLNormalizer
removeDotSegments()
Removes the unnecessary "." and ".." segments from the URL path.URLNormalizer
removeDuplicateSlashes()
Removes duplicate slashes.URLNormalizer
removeEmptyParameters()
Removes empty parameters.URLNormalizer
removeFragment()
Removes the URL fragment (from the first "#" character encountered to the end of the URL).URLNormalizer
removeQueryString()
Removes the URL query string (from the "?"URLNormalizer
removeSessionIds()
Removes a URL-based session id.URLNormalizer
removeTrailingFragment()
Removes the URL fragment likeremoveFragment()
, but only if it is found after the last URL segment (/...).URLNormalizer
removeTrailingHash()
Removes trailing hash character ("#").URLNormalizer
removeTrailingQuestionMark()
Removes trailing question mark ("?").URLNormalizer
removeTrailingSlash()
Removes any trailing slash (/) from a URL, before fragment (#) or query string (?).URLNormalizer
removeWWW()
Removes "www." domain name prefix.URLNormalizer
replaceIPWithDomainName()
Replaces IP address with domain name.URLNormalizer
secureScheme()
Convertshttp
scheme tohttps
.URLNormalizer
sortQueryParameters()
Sorts query parameters.String
toString()
Returns the normalized URL as string.URI
toURI()
Returns the normalized URL asURI
.URL
toURL()
Returns the normalized URL asURL
.URLNormalizer
unsecureScheme()
Convertshttps
scheme tohttp
.URLNormalizer
upperCaseEscapeSequence()
Converts letters in URL-encoded escape sequences to upper case.
-
-
-
Constructor Detail
-
URLNormalizer
public URLNormalizer(URL url)
Create a newURLNormalizer
instance.- Parameters:
url
- the url to normalize
-
URLNormalizer
public URLNormalizer(String url)
Create a new
URLNormalizer
instance.Since 1.8.0, spaces in URLs are no longer converted to + automatically. Use
encodeNonURICharacters()
orencodeSpaces()
.- Parameters:
url
- the url to normalize
-
-
Method Detail
-
lowerCase
public URLNormalizer lowerCase()
Converts the entire URL to lower case, including scheme, host name, path, query string parameter names and values. Consider using less aggressive variations of lower case methods to only focus on specific parts of a URL.
HTTP://www.Example.com/Path/Query?Param1=AAA&Param2=BBB → http://www.example.com/path/query?param1=aaa¶m2=bbb
- Returns:
- this instance
- Since:
- 1.15.1
-
lowerCaseSchemeHost
public URLNormalizer lowerCaseSchemeHost()
Converts the scheme and host to lower case.
HTTP://www.Example.com/ → http://www.example.com/
- Returns:
- this instance
-
lowerCasePath
public URLNormalizer lowerCasePath()
Converts the URL path to lower case.
http://www.example.com/AAA/BBB → http://www.example.com/aaa/bbb
- Returns:
- this instance
- Since:
- 1.15.1
-
lowerCaseQuery
public URLNormalizer lowerCaseQuery()
Converts the URL query string to lower case, which includes both the parameter names and values.
http://www.example.com/query?Param1=AAA&Param2=BBB → http://www.example.com/query?param1=aaa¶m2=bbb
- Returns:
- this instance
- Since:
- 1.15.1
-
lowerCaseQueryParameterNames
public URLNormalizer lowerCaseQueryParameterNames()
Converts the URL query parameter names to lower case, leaving query parameter values intact.
http://www.example.com/query?Param1=AAA&Param2=BBB → http://www.example.com/query?param1=AAA¶m2=BBB
- Returns:
- this instance
- Since:
- 1.15.1
-
lowerCaseQueryParameterValues
public URLNormalizer lowerCaseQueryParameterValues()
Converts the URL query parameter values to lower case, leaving query parameter names intact.
http://www.example.com/query?Param1=AAA&Param2=BBB → http://www.example.com/query?Param1=aaa&Param2=bbb
- Returns:
- this instance
- Since:
- 1.15.1
-
upperCaseEscapeSequence
public URLNormalizer upperCaseEscapeSequence()
Converts letters in URL-encoded escape sequences to upper case.http://www.example.com/a%c2%b1b → http://www.example.com/a%C2%B1b
- Returns:
- this instance
-
decodeUnreservedCharacters
public URLNormalizer decodeUnreservedCharacters()
Decodes percent-encoded unreserved characters.http://www.example.com/%7Eusername/ → http://www.example.com/~username/
- Returns:
- this instance
-
encodeNonURICharacters
public URLNormalizer encodeNonURICharacters()
Encodes all characters that are not supported characters in a URI (not to confuse with URL), as defined by the RFC 3986 standard. This includes all non-ASCII characters.
Since this method also encodes spaces to the plus sign (+), there is no need to also invoke
encodeSpaces()
.http://www.example.com/^a [b]/ → http://www.example.com/%5Ea+%5Bb%5D/
- Returns:
- this instance
- Since:
- 1.8.0
-
encodeSpaces
public URLNormalizer encodeSpaces()
Encodes space characters into plus signs (+) if they are part of the query string. Spaces part of the URL path are percent-encoded to %20.
To encode all non-ASCII characters (including spaces), use
encodeNonURICharacters()
instead.http://www.example.com/a b c → http://www.example.com/a+b+c
- Returns:
- this instance
- Since:
- 1.8.0
-
removeDefaultPort
public URLNormalizer removeDefaultPort()
Removes the default port (80 for http, and 443 for https).http://www.example.com:80/bar.html → http://www.example.com/bar.html
- Returns:
- this instance
-
addDirectoryTrailingSlash
public URLNormalizer addDirectoryTrailingSlash()
Adds a trailing slash (/) to a URL ending with a directory. A URL is considered to end with a directory if the last path segment, before fragment (#) or query string (?), does not contain a dot, typically representing an extension.
Please Note: URLs do not always denote a directory structure and many URLs can qualify to this method without truly representing a directory. Adding a trailing slash to these URLs could potentially break its semantic equivalence.
http://www.example.com/alice → http://www.example.com/alice/
- Returns:
- this instance
- Since:
- 1.11.0 (renamed from "addTrailingSlash")
-
addDomainTrailingSlash
public URLNormalizer addDomainTrailingSlash()
Adds a trailing slash (/) right after the domain for URLs with no path, before any fragment (#) or query string (?).
Please Note: Adding a trailing slash to URLs could potentially break its semantic equivalence.
http://www.example.com → http://www.example.com/
- Returns:
- this instance
- Since:
- 1.12.0
-
addTrailingSlash
@Deprecated public URLNormalizer addTrailingSlash()
Deprecated.Since 1.11.0, useaddDirectoryTrailingSlash()
Adds a trailing slash (/) to a URL ending with a directory. A URL is considered to end with a directory if the last path segment, before fragment (#) or query string (?), does not contain a dot, typically representing an extension.
Please Note: URLs do not always denote a directory structure and many URLs can qualify to this method without truly representing a directory. Adding a trailing slash to these URLs could potentially break its semantic equivalence.
http://www.example.com/alice → http://www.example.com/alice/
- Returns:
- this instance
-
removeTrailingSlash
public URLNormalizer removeTrailingSlash()
Removes any trailing slash (/) from a URL, before fragment (#) or query string (?).
Please Note: Removing trailing slashes form URLs could potentially break their semantic equivalence.
http://www.example.com/alice/ → http://www.example.com/alice
- Returns:
- this instance
- Since:
- 1.11.0
-
removeDotSegments
public URLNormalizer removeDotSegments()
Removes the unnecessary "." and ".." segments from the URL path.
As of 2.3.0, the algorithm used to remove the dot segments is the one prescribed by RFC3986.
http://www.example.com/../a/b/../c/./d.html → http://www.example.com/a/c/d.html
Please Note: URLs do not always represent a clean hierarchy structure and the dots/double-dots may have a different signification on some sites. Removing them from a URL could potentially break its semantic equivalence.
- Returns:
- this instance
- See Also:
URI.normalize()
-
removeDirectoryIndex
public URLNormalizer removeDirectoryIndex()
Removes directory index files. They are often not needed in URLs.
http://www.example.com/a/index.html → http://www.example.com/a/
Index files must be the last URL path segment to be considered. The following are considered index files:
- index.html
- index.htm
- index.shtml
- index.php
- default.html
- default.htm
- home.html
- home.htm
- index.php5
- index.php4
- index.php3
- index.cgi
- placeholder.html
- default.asp
Please Note: There are no guarantees a URL without its index files will be semantically equivalent, or even be valid.
- Returns:
- this instance
-
removeFragment
public URLNormalizer removeFragment()
Removes the URL fragment (from the first "#" character encountered to the end of the URL).
http://www.example.com/abc.html#section1 → http://www.example.com/abc.html
http://www.example.com/abc#/def/ghi → http://www.example.com/abc
http://www.example.com/abc#def/ghi#klm → http://www.example.com/abc
- Returns:
- this instance
-
removeTrailingFragment
public URLNormalizer removeTrailingFragment()
Removes the URL fragment like
removeFragment()
, but only if it is found after the last URL segment (/...).http://www.example.com/abc.html#section1 → http://www.example.com/abc.html
http://www.example.com/abc#/def/ghi → http://www.example.com/abc#/def/ghi
http://www.example.com/abc#def/ghi#klm → http://www.example.com/abc#def/ghi
- Returns:
- this instance
- Since:
- 2.1.0
-
removeQueryString
public URLNormalizer removeQueryString()
Removes the URL query string (from the "?" character until the end or the first # character).
http://www.example.com/query?param1=AAA7¶m2=BBB#fragment → http://www.example.com/query#fragment
- Returns:
- this instance
- Since:
- 1.15.1
-
replaceIPWithDomainName
public URLNormalizer replaceIPWithDomainName()
Replaces IP address with domain name. This is often not reliable due to virtual domain names and can be slow, as it has to access the network.
http://208.77.188.166/ → http://www.example.com/
- Returns:
- this instance
-
unsecureScheme
public URLNormalizer unsecureScheme()
Converts
https
scheme tohttp
.https://www.example.com/ → http://www.example.com/
- Returns:
- this instance
-
secureScheme
public URLNormalizer secureScheme()
Converts
http
scheme tohttps
.http://www.example.com/ → https://www.example.com/
- Returns:
- this instance
-
removeDuplicateSlashes
public URLNormalizer removeDuplicateSlashes()
Removes duplicate slashes. Two or more adjacent slash ("/") characters will be converted into one.
http://www.example.com/foo//bar.html → http://www.example.com/foo/bar.html
- Returns:
- this instance
-
removeWWW
public URLNormalizer removeWWW()
Removes "www." domain name prefix.
http://www.example.com/ → http://example.com/
- Returns:
- this instance
-
addWWW
public URLNormalizer addWWW()
Adds "www." domain name prefix.
http://example.com/ → http://www.example.com/
- Returns:
- this instance
-
sortQueryParameters
public URLNormalizer sortQueryParameters()
Sorts query parameters.
http://www.example.com/?z=bb&y=cc&z=aa → http://www.example.com/?y=cc&z=bb&z=aa
- Returns:
- this instance
-
removeEmptyParameters
public URLNormalizer removeEmptyParameters()
Removes empty parameters.
http://www.example.com/display?a=b&a=&c=d&e=&f=g → http://www.example.com/display?a=b&c=d&f=g
- Returns:
- this instance
-
removeTrailingQuestionMark
public URLNormalizer removeTrailingQuestionMark()
Removes trailing question mark ("?").
http://www.example.com/display? → http://www.example.com/display
- Returns:
- this instance
-
removeSessionIds
public URLNormalizer removeSessionIds()
Removes a URL-based session id. It removes PHP (PHPSESSID), ASP (ASPSESSIONID), and Java EE (jsessionid) session ids.
http://www.example.com/servlet;jsessionid=1E6FEC0D14D044541DD84D2D013D29ED?a=b → http://www.example.com/servlet?a=b
Please Note: Removing session IDs from URLs is often a good way to have the URL return an error once invoked.
- Returns:
- this instance
-
removeTrailingHash
public URLNormalizer removeTrailingHash()
Removes trailing hash character ("#").
http://www.example.com/path# → http://www.example.com/path
This only removes the hash character if it is the last character. To remove an entire URL fragment, use
removeFragment()
.- Returns:
- this instance
- Since:
- 1.13.0
-
toString
public String toString()
Returns the normalized URL as string.
-
-