PDFPageSplitter (Norconex Importer 3.0.1 API)

java.lang.Object
- com.norconex.importer.handler.AbstractImporterHandler
- - com.norconex.importer.handler.splitter.AbstractDocumentSplitter
  - - com.norconex.importer.handler.splitter.impl.PDFPageSplitter

All Implemented Interfaces:: IXMLConfigurable, IImporterHandler, IDocumentSplitter

public class PDFPageSplitter
extends AbstractDocumentSplitter
implements IXMLConfigurable

Split PDFs pages so each pages are treated as individual documents. May not work on all PDFs (e.g., encrypted PDFs).

The original PDF is kept intact. If you want to eliminate it to keep only the split pages, make sure to filter it. You can do so by filtering out PDFs without one of these two fields added to each pages: document.pdf.pageNumber or document.pdf.numberOfPages. A filtering example:

XML usage example:


<filter
    class="com.norconex.importer.handler.filter.impl.EmptyFilter"
    onMatch="exclude">
  <fieldMatcher
      matchWhole="true">
    document.pdf.pageNumber
  </fieldMatcher>
</filter>

By default this splitter restricts its use to document.contentType matching application/pdf.

Should be used as a pre-parse handler.

XML configuration usage:


<handler
    class="com.norconex.importer.handler.splitter.impl.PDFPageSplitter">
  <!-- multiple "restrictTo" tags allowed (only one needs to match) -->
  <restrictTo>
    <fieldMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (field-matching expression)
    </fieldMatcher>
    <valueMatcher
        method="[basic|csv|wildcard|regex]"
        ignoreCase="[false|true]"
        ignoreDiacritic="[false|true]"
        partial="[false|true]">
      (value-matching expression)
    </valueMatcher>
  </restrictTo>
  <referencePagePrefix>
    (String to put before the page number is appended to the document
    reference. Default is "#".)
  </referencePagePrefix>
</handler>

XML usage example:


<handler
    class="PDFPageSplitter">
  <referencePagePrefix>#page</referencePagePrefix>
</handler>

The above example will split PDFs and will append the page number to the original PDF reference as "#page1", "#page2", etc.

Since:: 2.9.0
Author:: Pascal Essiembre

Field Summary

Fields
Modifier and Type Field and Description

static String DEFAULT_REFERENCE_PAGE_PREFIX

static String DOC_PDF_PAGE_NO

static String DOC_PDF_TOTAL_PAGES

Fields
Modifier and Type	Field and Description
`static String`	`DEFAULT_REFERENCE_PAGE_PREFIX`
`static String`	`DOC_PDF_PAGE_NO`
`static String`	`DOC_PDF_TOTAL_PAGES`

Constructor Summary

Constructors
Constructor and Description

PDFPageSplitter()

Constructors
Constructor and Description
`PDFPageSplitter()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`equals(Object other)`
`String`	`getReferencePagePrefix()`
`int`	`hashCode()`
`protected void`	`loadHandlerFromXML(XML xml)` Loads configuration settings specific to the implementing class.
`protected void`	`saveHandlerToXML(XML xml)` Saves configuration settings specific to the implementing class.
`void`	`setReferencePagePrefix(String referencePagePrefix)`
`protected List<Doc>`	`splitApplicableDocument(HandlerDoc doc, InputStream input, OutputStream output, ParseState parseState)`
`String`	`toString()`

Methods inherited from class com.norconex.importer.handler.splitter.AbstractDocumentSplitter
splitDocument

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable
loadFromXML, saveToXML

- Field Detail
  - DOC_PDF_PAGE_NO
```
public static final String DOC_PDF_PAGE_NO
```
    See Also:
    
    Constant Field Values
  - DOC_PDF_TOTAL_PAGES
```
public static final String DOC_PDF_TOTAL_PAGES
```
    See Also:
    
    Constant Field Values
  - DEFAULT_REFERENCE_PAGE_PREFIX
```
public static final String DEFAULT_REFERENCE_PAGE_PREFIX
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - PDFPageSplitter
```
public PDFPageSplitter()
```
- Method Detail
  - getReferencePagePrefix
```
public String getReferencePagePrefix()
```
  - setReferencePagePrefix
```
public void setReferencePagePrefix(String referencePagePrefix)
```
  - splitApplicableDocument
```
protected List<Doc> splitApplicableDocument(HandlerDoc doc,
                                            InputStream input,
                                            OutputStream output,
                                            ParseState parseState)
                                     throws ImporterHandlerException
```
    Specified by:
    
    splitApplicableDocument in class AbstractDocumentSplitter
    
    Throws:
    
    ImporterHandlerException
  - loadHandlerFromXML
```
protected void loadHandlerFromXML(XML xml)
```
    Description copied from class: AbstractImporterHandler
    
    Loads configuration settings specific to the implementing class.
    
    Specified by:
    
    loadHandlerFromXML in class AbstractImporterHandler
    
    Parameters:
    
    xml - XML configuration
  - saveHandlerToXML
```
protected void saveHandlerToXML(XML xml)
```
    Description copied from class: AbstractImporterHandler
    
    Saves configuration settings specific to the implementing class.
    
    Specified by:
    
    saveHandlerToXML in class AbstractImporterHandler
    
    Parameters:
    
    xml - the XML
  - equals
```
public boolean equals(Object other)
```
    Overrides:
    
    equals in class AbstractImporterHandler
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class AbstractImporterHandler
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class AbstractImporterHandler

Class PDFPageSplitter

XML usage example:

XML configuration usage:

XML usage example:

Field Summary

Constructor Summary

Method Summary

Methods inherited from class com.norconex.importer.handler.splitter.AbstractDocumentSplitter

Methods inherited from class com.norconex.importer.handler.AbstractImporterHandler

Methods inherited from class java.lang.Object

Methods inherited from interface com.norconex.commons.lang.xml.IXMLConfigurable

Field Detail

DOC_PDF_PAGE_NO

DOC_PDF_TOTAL_PAGES

DEFAULT_REFERENCE_PAGE_PREFIX

Constructor Detail

PDFPageSplitter

Method Detail

getReferencePagePrefix

setReferencePagePrefix

splitApplicableDocument

loadHandlerFromXML

saveHandlerToXML

equals

hashCode

toString