Norconex Web Crawler

Flow Diagram

Do you sometimes wonder what your crawler is doing? Knowing more about the sequence of events taking place can help you better configure your crawling solution. The following flowcharts detail how each URL encountered is processed. While they do not cover all available features, they should give you a better idea of what's going on.

The upper flowchart shows how URLs are "prepared" before being queued for processing by the next available thread. The second flowchart shows what happens when a thread gets the next URL from the queue and "processes" it.

Click on a shape to get related information and links to more documentation.

Fetch HTTP headers? Delay (politeness) Queue Recrawl? Accepted by metadata filters? Canonical? New meta checksum? yes Fetch document yes yes no no no no no Save document Canonical? Extract metadata robot rules yes no Extract links Pre-process document Accepted by metadata filters? New meta checksum? yes Accepted by document filters? yes yes Import document no New doc checksum? Post-process document no yes Rejected yes Commit (add) Start no no Successful download? Successful download? Delete? no Commit (delete) no yes yes yes no yes Start Within max depth? Resolve sitemap.xml yes Rejected Accepted by reference filters? Accepted by robots.txt? yes Normalize URL yes Queue no no no Accepted by metadata robot rules? Accepted by Importer? yes no yes no