URLNormalizer (Norconex Commons Lang 2.0.2 API)

java.lang.Object
- com.norconex.commons.lang.url.URLNormalizer

All Implemented Interfaces:

Serializable
```
public class URLNormalizer
extends Object
implements Serializable
```
The general idea behind URL normalization is to make different URLs "equivalent" (i.e. eliminate URL variations pointing to the same resource). To achieve this, URLNormalizer takes a URL and modifies it to its most basic or standard form (for the context in which it is used). Of course URLNormalizer can simply be used as a generic URL manipulation tool for your needs.

You would typically "build" your normalized URL by invoking each method of interest, in the relevant order, using a similar approach:
```
 String url = "Http://Example.com:80//foo/index.html";
 URL normalizedURL = new URLNormalizer(url)
         .lowerCaseSchemeHost()
         .removeDefaultPort()
         .removeDuplicateSlashes()
         .removeDirectoryIndex()
         .addWWW()
         .toURL();
 System.out.println(normalizedURL.toString());
 // Output: http://www.example.com/foo/
```
Several normalization methods implemented come from the RFC 3986 standard. These standards and several more normalization techniques are very well summarized on the Wikipedia article titled URL Normalization. This class implements most normalizations described on that article and borrows several of its examples, as well as a few additional ones.

The normalization methods available can be broken down into three categories:

Preserving Semantics

The following normalizations are part of the RFC 3986 standard and should result in equivalent URLs (one that identifies the same resource):
Usually Preserving Semantics

The following techniques will generate a semantically equivalent URL for the majority of use cases but are not enforced as a standard.
- Add trailing slash
- Remove .dot segments
Not Preserving Semantics

These normalizations will fail to produce semantically equivalent URLs in many cases. They usually work best when you have a good understanding of the web site behind the supplied URL and whether for that site, normalizations used can be be considered to produce semantically equivalent URLs or not.
- Remove directory index
- Remove fragment (#)
- Replace IP with domain name
- Unsecure schema (https → http)
- Secure schema (http → https)
- Remove duplicate slashes
- Remove "www."
- Add "www."
- Sort query parameters
- Remove empty query parameters
- Remove trailing question mark (?)
- Remove session IDs
- Remove query string (since 1.15.1)
- Convert entire URL lower case (since 1.15.1)
- Convert URL path lower case (since 1.15.1)
- Convert URL query string parameter name and values to lower case (since 1.15.1)
- Convert URL query parameter names to lower case (since 1.15.1)
- Convert URL query parameter values to lower case (since 1.15.1)
Refer to each methods below for description and examples (or click on a normalization name above).
Author:

Pascal Essiembre

See Also:

Serialized Form

Constructor Summary

Constructors
Constructor and Description

URLNormalizer(String url)
Create a new URLNormalizer instance.

URLNormalizer(URL url)
Create a new URLNormalizer instance.

Constructors
Constructor and Description
`URLNormalizer(String url)` Create a new `URLNormalizer` instance.
`URLNormalizer(URL url)` Create a new `URLNormalizer` instance.

Method Summary

All Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`URLNormalizer`	`addDirectoryTrailingSlash()` Adds a trailing slash (/) to a URL ending with a directory.
`URLNormalizer`	`addDomainTrailingSlash()` Adds a trailing slash (/) right after the domain for URLs with no path, before any fragment (#) or query string (?).
`URLNormalizer`	`addTrailingSlash()` Deprecated. Since 1.11.0, use `addDirectoryTrailingSlash()`
`URLNormalizer`	`addWWW()` Adds "www." domain name prefix.
`URLNormalizer`	`decodeUnreservedCharacters()` Decodes percent-encoded unreserved characters.
`URLNormalizer`	`encodeNonURICharacters()` Encodes all characters that are not supported characters in a URI (not to confuse with URL), as defined by the RFC 3986 standard.
`URLNormalizer`	`encodeSpaces()` Encodes space characters into plus signs (+) if they are part of the query string.
`URLNormalizer`	`lowerCase()` Converts the entire URL to lower case, including scheme, host name, path, query string parameter names and values.
`URLNormalizer`	`lowerCasePath()` Converts the URL path to lower case.
`URLNormalizer`	`lowerCaseQuery()` Converts the URL query string to lower case, which includes both the parameter names and values.
`URLNormalizer`	`lowerCaseQueryParameterNames()` Converts the URL query parameter names to lower case, leaving query parameter values intact.
`URLNormalizer`	`lowerCaseQueryParameterValues()` Converts the URL query parameter values to lower case, leaving query parameter names intact.
`URLNormalizer`	`lowerCaseSchemeHost()` Converts the scheme and host to lower case.
`URLNormalizer`	`removeDefaultPort()` Removes the default port (80 for http, and 443 for https).
`URLNormalizer`	`removeDirectoryIndex()` Removes directory index files.
`URLNormalizer`	`removeDotSegments()` Removes the unnecessary "." and ".." segments from the URL path.
`URLNormalizer`	`removeDuplicateSlashes()` Removes duplicate slashes.
`URLNormalizer`	`removeEmptyParameters()` Removes empty parameters.
`URLNormalizer`	`removeFragment()` Removes the URL fragment (from the "#" character until the end).
`URLNormalizer`	`removeQueryString()` Removes the URL query string (from the "?" character until the end or the first # character).
`URLNormalizer`	`removeSessionIds()` Removes a URL-based session id.
`URLNormalizer`	`removeTrailingHash()` Removes trailing hash character ("#").
`URLNormalizer`	`removeTrailingQuestionMark()` Removes trailing question mark ("?").
`URLNormalizer`	`removeTrailingSlash()` Removes any trailing slash (/) from a URL, before fragment (#) or query string (?).
`URLNormalizer`	`removeWWW()` Removes "www." domain name prefix.
`URLNormalizer`	`replaceIPWithDomainName()` Replaces IP address with domain name.
`URLNormalizer`	`secureScheme()` Converts `http` scheme to `https`.
`URLNormalizer`	`sortQueryParameters()` Sorts query parameters.
`String`	`toString()` Returns the normalized URL as string.
`URI`	`toURI()` Returns the normalized URL as `URI`.
`URL`	`toURL()` Returns the normalized URL as `URL`.
`URLNormalizer`	`unsecureScheme()` Converts `https` scheme to `http`.
`URLNormalizer`	`upperCaseEscapeSequence()` Converts letters in URL-encoded escape sequences to upper case.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - URLNormalizer
```
public URLNormalizer(URL url)
```
    Create a new URLNormalizer instance.
    
    Parameters:
    
    url - the url to normalize
  - URLNormalizer
```
public URLNormalizer(String url)
```
    Create a new URLNormalizer instance.
    Since 1.8.0, spaces in URLs are no longer converted to + automatically. Use encodeNonURICharacters() or encodeSpaces().
    
    Parameters:
    
    url - the url to normalize
- Method Detail
  - lowerCase
```
public URLNormalizer lowerCase()
```
    Converts the entire URL to lower case, including scheme, host name, path, query string parameter names and values. Consider using less aggressive variations of lower case methods to only focus on specific parts of a URL.
    
    HTTP://www.Example.com/Path/Query?Param1=AAA&Param2=BBB → http://www.example.com/path/query?param1=aaa&param2=bbb
    
    Returns:
    
    this instance
    
    Since:
    
    1.15.1
  - lowerCaseSchemeHost
```
public URLNormalizer lowerCaseSchemeHost()
```
    Converts the scheme and host to lower case.
    
    HTTP://www.Example.com/ → http://www.example.com/
    
    Returns:
    
    this instance
  - lowerCasePath
```
public URLNormalizer lowerCasePath()
```
    Converts the URL path to lower case.
    
    http://www.example.com/AAA/BBB → http://www.example.com/aaa/bbb
    
    Returns:
    
    this instance
    
    Since:
    
    1.15.1
  - lowerCaseQuery
```
public URLNormalizer lowerCaseQuery()
```
    Converts the URL query string to lower case, which includes both the parameter names and values.
    
    http://www.example.com/query?Param1=AAA&Param2=BBB → http://www.example.com/query?param1=aaa&param2=bbb
    
    Returns:
    
    this instance
    
    Since:
    
    1.15.1
  - lowerCaseQueryParameterNames
```
public URLNormalizer lowerCaseQueryParameterNames()
```
    Converts the URL query parameter names to lower case, leaving query parameter values intact.
    
    http://www.example.com/query?Param1=AAA&Param2=BBB → http://www.example.com/query?param1=AAA&param2=BBB
    
    Returns:
    
    this instance
    
    Since:
    
    1.15.1
  - lowerCaseQueryParameterValues
```
public URLNormalizer lowerCaseQueryParameterValues()
```
    Converts the URL query parameter values to lower case, leaving query parameter names intact.
    
    http://www.example.com/query?Param1=AAA&Param2=BBB → http://www.example.com/query?Param1=aaa&Param2=bbb
    
    Returns:
    
    this instance
    
    Since:
    
    1.15.1
  - upperCaseEscapeSequence
```
public URLNormalizer upperCaseEscapeSequence()
```
    Converts letters in URL-encoded escape sequences to upper case.
    http://www.example.com/a%c2%b1b → http://www.example.com/a%C2%B1b
    
    Returns:
    
    this instance
  - decodeUnreservedCharacters
```
public URLNormalizer decodeUnreservedCharacters()
```
    Decodes percent-encoded unreserved characters.
    http://www.example.com/%7Eusername/ → http://www.example.com/~username/
    
    Returns:
    
    this instance
  - encodeNonURICharacters
```
public URLNormalizer encodeNonURICharacters()
```
    Encodes all characters that are not supported characters in a URI (not to confuse with URL), as defined by the RFC 3986 standard. This includes all non-ASCII characters.
    
    Since this method also encodes spaces to the plus sign (+), there is no need to also invoke encodeSpaces().
    http://www.example.com/^a [b]/ → http://www.example.com/%5Ea+%5Bb%5D/
    
    Returns:
    
    this instance
    
    Since:
    
    1.8.0
  - encodeSpaces
```
public URLNormalizer encodeSpaces()
```
    Encodes space characters into plus signs (+) if they are part of the query string. Spaces part of the URL path are percent-encoded to %20.
    
    To encode all non-ASCII characters (including spaces), use encodeNonURICharacters() instead.
    http://www.example.com/a b c → http://www.example.com/a+b+c
    
    Returns:
    
    this instance
    
    Since:
    
    1.8.0
  - removeDefaultPort
```
public URLNormalizer removeDefaultPort()
```
    Removes the default port (80 for http, and 443 for https).
    http://www.example.com:80/bar.html → http://www.example.com/bar.html
    
    Returns:
    
    this instance
  - addDirectoryTrailingSlash
```
public URLNormalizer addDirectoryTrailingSlash()
```
    Adds a trailing slash (/) to a URL ending with a directory. A URL is considered to end with a directory if the last path segment, before fragment (#) or query string (?), does not contain a dot, typically representing an extension.
    
    Please Note: URLs do not always denote a directory structure and many URLs can qualify to this method without truly representing a directory. Adding a trailing slash to these URLs could potentially break its semantic equivalence.
    http://www.example.com/alice → http://www.example.com/alice/
    
    Returns:
    
    this instance
    
    Since:
    
    1.11.0 (renamed from "addTrailingSlash")
  - addDomainTrailingSlash
```
public URLNormalizer addDomainTrailingSlash()
```
    Adds a trailing slash (/) right after the domain for URLs with no path, before any fragment (#) or query string (?).
    
    Please Note: Adding a trailing slash to URLs could potentially break its semantic equivalence.
    http://www.example.com → http://www.example.com/
    
    Returns:
    
    this instance
    
    Since:
    
    1.12.0
  - addTrailingSlash
```
@Deprecated
public URLNormalizer addTrailingSlash()
```
    Deprecated. Since 1.11.0, use addDirectoryTrailingSlash()
    
    Adds a trailing slash (/) to a URL ending with a directory. A URL is considered to end with a directory if the last path segment, before fragment (#) or query string (?), does not contain a dot, typically representing an extension.
    
    Please Note: URLs do not always denote a directory structure and many URLs can qualify to this method without truly representing a directory. Adding a trailing slash to these URLs could potentially break its semantic equivalence.
    http://www.example.com/alice → http://www.example.com/alice/
    
    Returns:
    
    this instance
  - removeTrailingSlash
```
public URLNormalizer removeTrailingSlash()
```
    Removes any trailing slash (/) from a URL, before fragment (#) or query string (?).
    
    Please Note: Removing trailing slashes form URLs could potentially break their semantic equivalence.
    http://www.example.com/alice/ → http://www.example.com/alice
    
    Returns:
    
    this instance
    
    Since:
    
    1.11.0
  - removeDotSegments
```
public URLNormalizer removeDotSegments()
```
    Removes the unnecessary "." and ".." segments from the URL path.
    
    As of 2.3.0, the algorithm used to remove the dot segments is the one prescribed by RFC3986.
    http://www.example.com/../a/b/../c/./d.html → http://www.example.com/a/c/d.html
    Please Note: URLs do not always represent a clean hierarchy structure and the dots/double-dots may have a different signification on some sites. Removing them from a URL could potentially break its semantic equivalence.
    
    Returns:
    
    this instance
    
    See Also:
    
    URI.normalize()
  - removeDirectoryIndex
```
public URLNormalizer removeDirectoryIndex()
```
    Removes directory index files. They are often not needed in URLs.
    http://www.example.com/a/index.html → http://www.example.com/a/
    Index files must be the last URL path segment to be considered. The following are considered index files:
    - index.html
    - index.htm
    - index.shtml
    - index.php
    - default.html
    - default.htm
    - home.html
    - home.htm
    - index.php5
    - index.php4
    - index.php3
    - index.cgi
    - placeholder.html
    - default.asp
    Please Note: There are no guarantees a URL without its index files will be semantically equivalent, or even be valid.
    Returns:
    
    this instance
  - removeFragment
```
public URLNormalizer removeFragment()
```
    Removes the URL fragment (from the "#" character until the end).
    http://www.example.com/bar.html#section1 → http://www.example.com/bar.html
    
    Returns:
    
    this instance
  - removeQueryString
```
public URLNormalizer removeQueryString()
```
    Removes the URL query string (from the "?" character until the end or the first # character).
    http://www.example.com/query?param1=AAA7&param2=BBB#fragment → http://www.example.com/query#fragment
    
    Returns:
    
    this instance
    
    Since:
    
    1.15.1
  - replaceIPWithDomainName
```
public URLNormalizer replaceIPWithDomainName()
```
    Replaces IP address with domain name. This is often not reliable due to virtual domain names and can be slow, as it has to access the network.
    http://208.77.188.166/ → http://www.example.com/
    
    Returns:
    
    this instance
  - unsecureScheme
```
public URLNormalizer unsecureScheme()
```
    Converts https scheme to http.
    https://www.example.com/ → http://www.example.com/
    
    Returns:
    
    this instance
  - secureScheme
```
public URLNormalizer secureScheme()
```
    Converts http scheme to https.
    http://www.example.com/ → https://www.example.com/
    
    Returns:
    
    this instance
  - removeDuplicateSlashes
```
public URLNormalizer removeDuplicateSlashes()
```
    Removes duplicate slashes. Two or more adjacent slash ("/") characters will be converted into one.
    http://www.example.com/foo//bar.html → http://www.example.com/foo/bar.html
    
    Returns:
    
    this instance
  - removeWWW
```
public URLNormalizer removeWWW()
```
    Removes "www." domain name prefix.
    http://www.example.com/ → http://example.com/
    
    Returns:
    
    this instance
  - addWWW
```
public URLNormalizer addWWW()
```
    Adds "www." domain name prefix.
    http://example.com/ → http://www.example.com/
    
    Returns:
    
    this instance
  - sortQueryParameters
```
public URLNormalizer sortQueryParameters()
```
    Sorts query parameters.
    http://www.example.com/?z=bb&y=cc&z=aa → http://www.example.com/?y=cc&z=bb&z=aa
    
    Returns:
    
    this instance
  - removeEmptyParameters
```
public URLNormalizer removeEmptyParameters()
```
    Removes empty parameters.
    http://www.example.com/display?a=b&a=&c=d&e=&f=g → http://www.example.com/display?a=b&c=d&f=g
    
    Returns:
    
    this instance
  - removeTrailingQuestionMark
```
public URLNormalizer removeTrailingQuestionMark()
```
    Removes trailing question mark ("?").
    http://www.example.com/display? → http://www.example.com/display
    
    Returns:
    
    this instance
  - removeSessionIds
```
public URLNormalizer removeSessionIds()
```
    Removes a URL-based session id. It removes PHP (PHPSESSID), ASP (ASPSESSIONID), and Java EE (jsessionid) session ids.
    http://www.example.com/servlet;jsessionid=1E6FEC0D14D044541DD84D2D013D29ED?a=b → http://www.example.com/servlet?a=b
    Please Note: Removing session IDs from URLs is often a good way to have the URL return an error once invoked.
    
    Returns:
    
    this instance
  - removeTrailingHash
```
public URLNormalizer removeTrailingHash()
```
    Removes trailing hash character ("#").
    http://www.example.com/path# → http://www.example.com/path
    This only removes the hash character if it is the last character. To remove an entire URL fragment, use removeFragment().
    
    Returns:
    
    this instance
    
    Since:
    
    1.13.0
  - toString
```
public String toString()
```
    Returns the normalized URL as string.
    
    Overrides:
    
    toString in class Object
    
    Returns:
    
    URL
  - toURI
```
public URI toURI()
```
    Returns the normalized URL as URI.
    
    Returns:
    
    URI
  - toURL
```
public URL toURL()
```
    Returns the normalized URL as URL.
    
    Returns:
    
    URI

Class URLNormalizer

Preserving Semantics

Usually Preserving Semantics

Not Preserving Semantics

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

URLNormalizer

URLNormalizer

Method Detail

lowerCase

lowerCaseSchemeHost

lowerCasePath

lowerCaseQuery

lowerCaseQueryParameterNames

lowerCaseQueryParameterValues

upperCaseEscapeSequence

decodeUnreservedCharacters

encodeNonURICharacters

encodeSpaces

removeDefaultPort

addDirectoryTrailingSlash

addDomainTrailingSlash

addTrailingSlash

removeTrailingSlash

removeDotSegments

removeDirectoryIndex

removeFragment

removeQueryString

replaceIPWithDomainName

unsecureScheme

secureScheme

removeDuplicateSlashes

removeWWW

addWWW

sortQueryParameters

removeEmptyParameters

removeTrailingQuestionMark

removeSessionIds

removeTrailingHash

toString

toURI

toURL