PDFPageSplitter (Norconex Importer 2.11.0 API)

java.lang.Object
- com.norconex.importer.handler.AbstractImporterHandler
- - com.norconex.importer.handler.splitter.AbstractDocumentSplitter
  - - com.norconex.importer.handler.splitter.impl.PDFPageSplitter

All Implemented Interfaces:: IXMLConfigurable, IImporterHandler, IDocumentSplitter

public class PDFPageSplitter
extends AbstractDocumentSplitter
implements IXMLConfigurable

Split PDFs pages so each pages are treated as individual documents. May not work on all PDFs (e.g., encrypted PDFs).

The original PDF is kept intact. If you want to eliminate it to keep only the split pages, make sure to filter it. You can do so by filtering out PDFs without one of these two fields added to each pages: document.pdf.pageNumber or document.pdf.numberOfPages. A filtering example:

 <filter class="com.norconex.importer.handler.filter.impl.EmptyMetadataFilter"
         onMatch="exclude" fields="document.pdf.pageNumber" />

By default this splitter restricts its use to document.contentType matching application/pdf.

Should be used as a pre-parse handler.

XML configuration usage:

  <splitter class="com.norconex.importer.handler.splitter.impl.PDFPageSplitter">

      <restrictTo caseSensitive="[false|true]"
              field="(name of header/metadata field name to match)">
          (Regular expression of value to match. Default restricts on
           "document.contentType" being "application/pdf".)
      </restrictTo>
      <!-- multiple "restrictTo" tags allowed (only one needs to match) -->

      <referencePagePrefix>
          (String to put before the page number is appended to the document
          reference. Default is "#".)
      </referencePagePrefix>

  </splitter>

Usage example:

The following example will split PDFs and will append the page number to the original PDF reference as "#page1", "#page2", etc.

  <splitter class="com.norconex.importer.handler.splitter.impl.PDFPageSplitter">
      <referencePagePrefix>#page</referencePagePrefix>
  </splitter>

Since:: 2.9.0
Author:: Pascal Essiembre

Field Summary

Fields
Modifier and Type Field and Description

static String DEFAULT_REFERENCE_PAGE_PREFIX

static String DOC_PDF_PAGE_NO

static String DOC_PDF_TOTAL_PAGES

Fields
Modifier and Type	Field and Description
`static String`	`DEFAULT_REFERENCE_PAGE_PREFIX`
`static String`	`DOC_PDF_PAGE_NO`
`static String`	`DOC_PDF_TOTAL_PAGES`

Constructor Summary

Constructors
Constructor and Description

PDFPageSplitter()

Constructors
Constructor and Description
`PDFPageSplitter()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`equals(Object other)`
`String`	`getReferencePagePrefix()`
`int`	`hashCode()`
`protected void`	`loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml)` Loads configuration settings specific to the implementing class.
`protected void`	`saveHandlerToXML(EnhancedXMLStreamWriter writer)` Saves configuration settings specific to the implementing class.
`void`	`setReferencePagePrefix(String referencePagePrefix)`
`protected List<ImporterDocument>`	`splitApplicableDocument(SplittableDocument doc, OutputStream output, CachedStreamFactory streamFactory, boolean parsed)`
`String`	`toString()`

Methods inherited from class com.norconex.importer.handler.splitter.AbstractDocumentSplitter
splitDocument

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Methods inherited from interface com.norconex.commons.lang.config.IXMLConfigurable
loadFromXML, saveToXML

- Field Detail
  - DOC_PDF_PAGE_NO
```
public static final String DOC_PDF_PAGE_NO
```
    See Also:
    
    Constant Field Values
  - DOC_PDF_TOTAL_PAGES
```
public static final String DOC_PDF_TOTAL_PAGES
```
    See Also:
    
    Constant Field Values
  - DEFAULT_REFERENCE_PAGE_PREFIX
```
public static final String DEFAULT_REFERENCE_PAGE_PREFIX
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - PDFPageSplitter
```
public PDFPageSplitter()
```
- Method Detail
  - getReferencePagePrefix
```
public String getReferencePagePrefix()
```
  - setReferencePagePrefix
```
public void setReferencePagePrefix(String referencePagePrefix)
```
  - splitApplicableDocument
```
protected List<ImporterDocument> splitApplicableDocument(SplittableDocument doc,
                                                         OutputStream output,
                                                         CachedStreamFactory streamFactory,
                                                         boolean parsed)
                                                  throws ImporterHandlerException
```
    Specified by:
    
    splitApplicableDocument in class AbstractDocumentSplitter
    
    Throws:
    
    ImporterHandlerException
  - loadHandlerFromXML
```
protected void loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
                           throws IOException
```
    Description copied from class: AbstractImporterHandler
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadHandlerFromXML in class AbstractImporterHandler
    
    Parameters:
    
    xml - xml configuration
    
    Throws:
    
    IOException - could not load from XML
  - saveHandlerToXML
```
protected void saveHandlerToXML(EnhancedXMLStreamWriter writer)
                         throws XMLStreamException
```
    Description copied from class: AbstractImporterHandler
    
    Saves configuration settings specific to the implementing class. The parent tag along with the "class" attribute are already written. Implementors must not close the writer.
    
    Specified by:
    
    saveHandlerToXML in class AbstractImporterHandler
    
    Parameters:
    
    writer - the xml writer
    
    Throws:
    
    XMLStreamException - could not save to XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractImporterHandler
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractImporterHandler
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractImporterHandler

Class PDFPageSplitter

XML configuration usage:

Usage example:

Field Summary

Constructor Summary

Method Summary

Methods inherited from class com.norconex.importer.handler.splitter.AbstractDocumentSplitter

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler

Methods inherited from class java.lang.Object

Methods inherited from interface com.norconex.commons.lang.config.IXMLConfigurable

Field Detail

DOC_PDF_PAGE_NO

DOC_PDF_TOTAL_PAGES

DEFAULT_REFERENCE_PAGE_PREFIX

Constructor Detail

PDFPageSplitter

Method Detail

getReferencePagePrefix

setReferencePagePrefix

splitApplicableDocument

loadHandlerFromXML

saveHandlerToXML

equals

hashCode

toString