Norconex Importer
Supported Document Formats
While you can write your own text extraction
parsers
to handle different document formats, chances are the Norconex Importer
already supports your file formats.
The following is a partial list document formats supported by Norconex Importer.
Most are handled by Apacke Tika. You
can find a detailed break-down of those by content type
here.
If you find more supported formats or inaccuracies, please let us know.
- Plain Text
- HTML
- XML
- XFDL (PureEdge)
- Feeds
- RSS
- ATOM
- IPTC ANPA New Wire Feeds
- Source Code
- Others (may take them as is)
- Microsoft Office
- Excel
- Word
- PowerPoint
- Visio
- Outlook
- Publisher
- Works
- Access
- Files serialized using MFC API
- Owner
- Open Office (Open Document Format)
- IWork
- WordPerfect
- QuattroPro
- PDF (including XFA dynamic forms)
- Electronic Publication Format (EPUB)
- Rich Text Format (RTF)
- Web Video Text Tracks Format (WebVTT)
- Geographic ISO 19139 files
- Concise Binary Object Representation (CBOR)
- Compressed/packaged files
- ar
- cpio
- Unix dump
- tar
- zip
- gzip
- XZ
- Pack200
- bzip2
- 7z
- arj
- lzma
- snappy
- Z
- jar
- Audio (mostly metadata only)
- Image formats (metadata only)
- bitmap (including EXIF)
- gif
- jpeg (including JPX/JPEG2000)
- png
- tiff
- psd
- xmp
- bpg
- JBIG2
- DJVU
- Video (metadata only)
- Java classes
- Fonts
- True Type Fonts (metadata only)
- AFM font files
- Email (various formats)
- Microsoft WinHelp (CHM)
- AutoCAD (metadata only)
- Scientific formats
- Public-key cryptography standards (PKCS) files
- Portable Executable (PE)
- cpl
- exe
- dll
- ocx
- sys
- scr
- drv
- efi
- Executable and Linkable Format (ELF)
- o
- so
- elf
- prx
- puff
- bin
- Other ELF files regardless of extension
- ISO files
- Endnote
- Windows Media Metafile
- iCal and vCalendar
- Stata DTA
- DBF