public class URLStatusCrawlerEventListener extends Object implements ICrawlerEventListener, IXMLConfigurable
Store on file all URLs that were "fetched", along with their HTTP response code, usually for reporting purposes (e.g. finding broken links). A short summary of all HTTP status codes can be found here.
By default, the status of all fetched URLs are stored by this listener,
regardless what were those statuses. This can generate very lengthy reports
on large crawls. If you are only interested in certain status codes, you can
listen only for those using the setStatusCodes(String)
method
or XML configuration equivalent. You specify the codes you want to listen
for as coma-separated values. Ranges are also supported: specify two range
values (both inclusive) separated by an hyphen. For instance, if you want
to store all "bad" URLs, you can quickly specify all codes except
200 (OK) this way:
100-199,201-599
The generated report will be stored in the directory specified by
using setOutputDir(String)
. By default, the file
generated will use this naming pattern:
urlstatuses-[crawlerId]-[timestamp].tsv
The filename prefix can be changed from "urlstatuses-" to anything else
using setFileNamePrefix(String)
.
To capture the referring pages you have to use a link extractor that
extracts referrer information. The default link extractor
GenericLinkExtractor
properly extracts this information. Same with
TikaLinkExtractor
. This is only a consideration when
using a custom link extractor.
<listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener"> <statusCodes>(CSV list of status codes)</statusCodes> <outputDir>(path to a directory of your choice)</outputDir> <fileNamePrefix>(report file name prefix)</fileNamePrefix> </listener>
The following example will generate a broken links report by recording 404 status codes (from HTTP response).
<listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener"> <statusCodes>404</statusCodes> <outputDir>/report/path/</outputDir> <fileNamePrefix>brokenLinks</fileNamePrefix> </listener>
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_FILENAME_PREFIX |
Constructor and Description |
---|
URLStatusCrawlerEventListener() |
Modifier and Type | Method and Description |
---|---|
void |
crawlerEvent(ICrawler crawler,
CrawlerEvent event) |
boolean |
equals(Object other) |
String |
getFileNamePrefix()
Gets the generated report file name prefix.
|
String |
getOutputDir()
Gets the local directory where this listener report will be written.
|
String |
getStatusCodes()
Gets the status codes to listen for.
|
int |
hashCode() |
void |
loadFromXML(Reader in) |
void |
saveToXML(Writer out) |
void |
setFileNamePrefix(String fileNamePrefix)
Sets the generated report file name prefix.
|
void |
setOutputDir(String outputDir)
Sets the local directory where this listener report will be written.
|
void |
setStatusCodes(String statusCodes)
Sets a coma-separated list of status codes to listen to.
|
String |
toString() |
public static final String DEFAULT_FILENAME_PREFIX
public String getStatusCodes()
null
(listens for all status codes).public void setStatusCodes(String statusCodes)
statusCodes
- the status codes to listen forpublic String getOutputDir()
public void setOutputDir(String outputDir)
outputDir
- directory pathpublic String getFileNamePrefix()
public void setFileNamePrefix(String fileNamePrefix)
fileNamePrefix
- file name prefixpublic void crawlerEvent(ICrawler crawler, CrawlerEvent event)
crawlerEvent
in interface ICrawlerEventListener
public void loadFromXML(Reader in) throws IOException
loadFromXML
in interface IXMLConfigurable
IOException
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
Copyright © 2009–2021 Norconex Inc.. All rights reserved.