Class GenericURLNormalizer

java.lang.Object
com.norconex.collector.http.url.impl.GenericURLNormalizer
All Implemented Interfaces:
IURLNormalizer, IXMLConfigurable

public class GenericURLNormalizer extends Object implements IURLNormalizer, IXMLConfigurable

Generic implementation of IURLNormalizer that should satisfy most URL normalization needs. This implementation relies on URLNormalizer. Please refer to it for complete documentation and examples.

This class is in effect by default. To skip its usage, you can explicitly set the URL Normalizer to null in the HttpCrawlerConfig, or you can disable it using setDisabled(boolean).

By default, this class removes the URL fragment and applies these RFC 3986 normalizations:

  • Converting the scheme and host to lower case
  • Capitalizing letters in escape sequences
  • Decoding percent-encoded unreserved characters
  • Removing the default port
  • Encoding non-URI characters

To overwrite this default, you have to specify a new list of normalizations to apply, via the setNormalizations(Normalization...) method, or via XML configuration. Each normalizations is identified by a code name. The following is the complete code name list for supported normalizations. Click on any code name to get a full description from URLNormalizer:

In addition, this class allows you to specify any number of URL value replacements using regular expressions.

XML configuration usage:


<urlNormalizer
    class="com.norconex.collector.http.url.impl.GenericURLNormalizer"
    disabled="[false|true]">
  <normalizations>(normalization code names, coma separated)</normalizations>
  <replacements>
    <replace>
      <match>(regex pattern to match)</match>
      <replacement>(optional replacement value, default to blank)</replacement>
    </replace>
    (... repeat replace tag  as needed ...)
  </replacements>
</urlNormalizer>

Since 2.7.2, having an empty "normalizations" tag will effectively remove any normalizations rules previously set (like default ones). Not having the tag at all will keep existing/default normalizations.

XML usage example:


<urlNormalizer
    class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
  <normalizations>
    removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
    decodeUnreservedCharacters, removeDefaultPort,
    encodeNonURICharacters, addWWW
  </normalizations>
  <replacements>
    <replace>
      <match>&amp;amp;view=print</match>
    </replace>
    <replace>
      <match>(&amp;amp;type=)(summary)</match>
      <replacement>$1full</replacement>
    </replace>
  </replacements>
</urlNormalizer>

The following adds a normalization to add "www." to URL domains when missing, to the default set of normalizations. It also add custom URL "search-and-replace" to remove any "&view=print" strings from URLs as well as replace "&type=summary" with "&type=full".

Author:
Pascal Essiembre