Norconex File System Crawler

Open-Source Enterprise Crawler (AKA Norconex Filesystem Collector)

Getting Started Download

v3 stack update: File System Crawler now follows the synchronized 3.x stack line with v3 coordinates. See what changed.

Crawl File Systems

Use Norconex flexible open-source enterprise file system crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.


Features

Norconex File System Crawler shares common features with other Norconex Crawlers. Find out about those here. The following is a non exhaustive list of features supported by the Norconex File System Crawler:

  • Multi-threaded
  • Extract text out of many file formats (HTML, PDF, Word, etc.)
  • Extract metadata associated with documents.
  • Language detection.
  • OCR support on images and PDFs.
  • Translation support.
  • Dynamic title generation.
  • Allow easy text transformation and metadata manipulation.
  • Filters unwanted document.
  • Detects modified and deleted documents.
  • Easy to add your own features.
  • Can be used command-line or embedded in your Java application.
  • Can treat embedded documents as distinct documents.
  • Can split a formatted document into multiple documents.
  • Can store crawled URLs in different database engines.
  • Fires crawler event types for custom listeners.
  • Date parsers/formatters to match your source/target repository dates.
  • Can create hierarchical fields.
  • Supports scripting languages for manipulating documents.
  • Reference XML/HTML elements using simple DOM tree navigation.
  • Extract document ACL from SMB/CIFS file systems.
  • Many others.

Supported Filesystems

  • Local files
  • BZIP2, GZIP
  • FTP, FTPS
  • HDFS
  • HTTP, HTTPS
  • RAM
  • RES
  • SFTP
  • Temp
  • WebDAV
  • Zip, Jar and Tar
  • SMB/CIFS

Latest news

Norconex Google Cloud Search Committer in v3 Stack
2026-07-01
Google Cloud Search Committer now has a Norconex-managed 3.x release line in the v3 stack. The legacy 2.x line developed and hosted by Google remains available. More...

Norconex Filesystem Collector joins v3 Stack
2026-07-01
File System Crawler now has a Norconex-managed 3.x release line in the synchronized v3 stack, while the legacy 2.x line remains available. More...

Norconex v3 Stack Coordinates Update
2026-07-01
The v3 train now uses synchronized 3.2.0-SNAPSHOT versions with the new com.norconex.collectors.v3 groupId and Maven path strategy. More...

Norconex Web Crawler 3.1.0 Released
2025-05-24
Additional options for WebDriver fetcher, bug fixes, and others. More...