Package com.norconex.collector.http.url
Interface IURLNormalizer
- All Known Implementing Classes:
GenericURLNormalizer
public interface IURLNormalizer
Responsible for normalizing URLs. Normalization is taking a raw URL and
modifying it to its most basic or standard form. In other words, this makes
different URLs "equivalent". This allows to eliminate URL variations
that points to the same content (e.g. URL carrying temporary session
information). This action takes place right after URLs are extracted
from a document, before each of these URLs is even considered
for further processing. Returning null will effectively tells the crawler
to not even consider it for processing (it won't go through the regular
document processing flow). You may want to consider
IReferenceFilter
to exclude URLs as part has the regular document processing flow
(may create a trace in the logs and gives you more options).
Implementors also implementing IXMLConfigurable must name their XML tag
urlNormalizer to ensure it gets loaded properly.- Author:
- Pascal Essiembre
-
Method Summary
Modifier and TypeMethodDescriptionnormalizeURL(String url) Normalize the given URL.static StringnormalizeURL(String url, List<IURLNormalizer> normalizers) Normalizes a URL by applying each normalizers in the list.
-
Method Details
-
normalizeURL
Normalize the given URL.- Parameters:
url- the URL to normalize- Returns:
- the normalized URL
-
normalizeURL
Normalizes a URL by applying each normalizers in the list.- Parameters:
url- the URL to normalizenormalizers- the normalizers- Returns:
- the normalized URL
-