Interface IURLNormalizer

  • All Known Implementing Classes:
    GenericURLNormalizer

    public interface IURLNormalizer
    Responsible for normalizing URLs. Normalization is taking a raw URL and modifying it to its most basic or standard form. In other words, this makes different URLs "equivalent". This allows to eliminate URL variations that points to the same content (e.g. URL carrying temporary session information). This action takes place right after URLs are extracted from a document, before each of these URLs is even considered for further processing. Returning null will effectively tells the crawler to not even consider it for processing (it won't go through the regular document processing flow). You may want to consider IReferenceFilter to exclude URLs as part has the regular document processing flow (may create a trace in the logs and gives you more options). Implementors also implementing IXMLConfigurable must name their XML tag urlNormalizer to ensure it gets loaded properly.
    Author:
    Pascal Essiembre
    • Method Detail

      • normalizeURL

        String normalizeURL​(String url)
        Normalize the given URL.
        Parameters:
        url - the URL to normalize
        Returns:
        the normalized URL
      • normalizeURL

        static String normalizeURL​(String url,
                                   List<IURLNormalizer> normalizers)
        Normalizes a URL by applying each normalizers in the list.
        Parameters:
        url - the URL to normalize
        normalizers - the normalizers
        Returns:
        the normalized URL