Package com.norconex.collector.http.url
Interface IURLNormalizer
-
- All Known Implementing Classes:
GenericURLNormalizer
public interface IURLNormalizer
Responsible for normalizing URLs. Normalization is taking a raw URL and modifying it to its most basic or standard form. In other words, this makes different URLs "equivalent". This allows to eliminate URL variations that points to the same content (e.g. URL carrying temporary session information). This action takes place right after URLs are extracted from a document, before each of these URLs is even considered for further processing. Returning null will effectively tells the crawler to not even consider it for processing (it won't go through the regular document processing flow). You may want to considerIReferenceFilter
to exclude URLs as part has the regular document processing flow (may create a trace in the logs and gives you more options). Implementors also implementing IXMLConfigurable must name their XML tagurlNormalizer
to ensure it gets loaded properly.- Author:
- Pascal Essiembre
-
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Modifier and Type Method Description String
normalizeURL(String url)
Normalize the given URL.static String
normalizeURL(String url, List<IURLNormalizer> normalizers)
Normalizes a URL by applying each normalizers in the list.
-
-
-
Method Detail
-
normalizeURL
String normalizeURL(String url)
Normalize the given URL.- Parameters:
url
- the URL to normalize- Returns:
- the normalized URL
-
normalizeURL
static String normalizeURL(String url, List<IURLNormalizer> normalizers)
Normalizes a URL by applying each normalizers in the list.- Parameters:
url
- the URL to normalizenormalizers
- the normalizers- Returns:
- the normalized URL
-
-