Norconex Web Crawler

Flow Diagram

Knowing more about the sequence of events taking place can help you better configure your crawling solution. The following flowchart details how each URL encountered is processed. While it does not cover all available features, it will give you a better idea of what's going on under the hood.

Click on a shape to get additional information.

HTTP HEAD enabled? Delay (politeness) Recrawl? Accepted by metadata filters? Canonical? New metadata checksum? Fetch document(HTTP GET) yes yes no no no no Save document Canonical? Extract document robot rules yes Extract links Pre-importprocess document Accepted by metadata filters? New metadata checksum? Accepted by document filters? yes Import document yes Successful? Successful? yes yes yes Accepted by document robot rules? REJECTED_PREMATURE Fetch headers(HTTP HEAD) yes REJECTED_FILTER Metadata unique? REJECTED_NONCANONICAL REJECTED_UNMODIFIED no REJECTED_DUPLICATE no no no yes REJECTED_UNMODIFIED REJECTED_NOTFOUND REJECTED_BAD_STATUS REJECTED_UNMODIFIED REJECTED_NOTFOUND REJECTED_BAD_STATUS REJECTED_NONCANONICAL no Start(each links) REJECTED_ROBOTS_META_NOINDEX REJECTED_FILTER REJECTED_UNMODIFIED REJECTED_DUPLICATE yes yes no REJECTED_FILTER Metadata unique? no no no no Accepted by Importer? yes no Post-importprocess document REJECTED_ERROR REJECTED_IMPORT New document checksum? Document unique? yes yes yes REJECTED_UNMODIFIED REJECTED_DUPLICATE Post-importextract links End no no yes Commit document(upsert) Start Within max depth? Resolve sitemap.xml yes REJECTED_TOO_DEEP Accepted by reference filters? Accepted by robots.txt? yes Normalize URL yes REJECTED_FILTER REJECTED_ROBOTS_TXT no no no Store in queue Start(each links) Pull from queue Queue

What about deletions?

The diagram covers "upserts" only. Your Committer can also receive deletion requests. The conditions triggering deletions are many and are greatly influenced by configuration options. A few examples that may apply:

  • "Orphan" pages detected.
  • "Not Found" pages detected.
  • Pages generating errors.
  • Selected crawler events.
  • New filtering rules (e.g., robots.txt, custom, etc.).