public interface ISitemapResolver
Given a URL root, resolve the corresponding sitemap(s), if any, and only if it has not yet been resolved for a crawling session.
Sitemaps URLs can be specified as "start URLs" (defined in
HttpCrawlerConfig.getStartSitemapURLs()
). It is up to the selected
implementation to decide whether to treat sitemaps specified as start URLs
any differently.
When ignoring sitemap (HttpCrawlerConfig.isIgnoreSitemap()
),
the selected sitemap
resolver implementation will still be invoked for sitemaps specified as
start URLs.
Sitemaps locations to resolved may also come from a site
robots.txt
(provided robots.txt files are not ignored).
Is it possible for implementations to not attempt to resolve sitemaps for some URLs. Refer to specific implementation for more details.
Modifier and Type | Method and Description |
---|---|
void |
resolveSitemaps(HttpFetchClient httpFetcher,
String urlRoot,
List<String> sitemapLocations,
Consumer<HttpDocInfo> sitemapURLConsumer,
boolean startURLs)
Resolves the sitemap instructions for a URL "root" (e.g.
|
void resolveSitemaps(HttpFetchClient httpFetcher, String urlRoot, List<String> sitemapLocations, Consumer<HttpDocInfo> sitemapURLConsumer, boolean startURLs)
httpFetcher
- the http fetcher executor to use to stream Internet
files if neededurlRoot
- the URL root for which to resolve the sitemapsitemapLocations
- sitemap locations to resolvesitemapURLConsumer
- where to store retrieved site map URLsstartURLs
- whether the sitemapLocations provided (if any) are
start URLs (defined in HttpCrawlerConfig.getStartSitemapURLs()
)Copyright © 2009–2023 Norconex Inc.. All rights reserved.