public class PDFPageSplitter extends AbstractDocumentSplitter implements IXMLConfigurable
Split PDFs pages so each pages are treated as individual documents. May not work on all PDFs (e.g., encrypted PDFs).
The original PDF is kept intact. If you want to eliminate it to keep only
the split pages, make sure to filter it. You can do so by filtering
out PDFs without one of these two fields added to each pages:
document.pdf.pageNumber
or
document.pdf.numberOfPages
. A filtering example:
<filter class="com.norconex.importer.handler.filter.impl.EmptyMetadataFilter" onMatch="exclude" fields="document.pdf.pageNumber" />
By default this splitter restricts its use to
document.contentType
matching application/pdf
.
Should be used as a pre-parse handler.
<splitter class="com.norconex.importer.handler.splitter.impl.PDFPageSplitter"> <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (Regular expression of value to match. Default restricts on "document.contentType" being "application/pdf".) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <referencePagePrefix> (String to put before the page number is appended to the document reference. Default is "#".) </referencePagePrefix> </splitter>
The following example will split PDFs and will append the page number to the original PDF reference as "#page1", "#page2", etc.
<splitter class="com.norconex.importer.handler.splitter.impl.PDFPageSplitter"> <referencePagePrefix>#page</referencePagePrefix> </splitter>
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_REFERENCE_PAGE_PREFIX |
static String |
DOC_PDF_PAGE_NO |
static String |
DOC_PDF_TOTAL_PAGES |
Constructor and Description |
---|
PDFPageSplitter() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
String |
getReferencePagePrefix() |
int |
hashCode() |
protected void |
loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveHandlerToXML(EnhancedXMLStreamWriter writer)
Saves configuration settings specific to the implementing class.
|
void |
setReferencePagePrefix(String referencePagePrefix) |
protected List<ImporterDocument> |
splitApplicableDocument(SplittableDocument doc,
OutputStream output,
CachedStreamFactory streamFactory,
boolean parsed) |
String |
toString() |
splitDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
public static final String DOC_PDF_PAGE_NO
public static final String DOC_PDF_TOTAL_PAGES
public static final String DEFAULT_REFERENCE_PAGE_PREFIX
public String getReferencePagePrefix()
public void setReferencePagePrefix(String referencePagePrefix)
protected List<ImporterDocument> splitApplicableDocument(SplittableDocument doc, OutputStream output, CachedStreamFactory streamFactory, boolean parsed) throws ImporterHandlerException
splitApplicableDocument
in class AbstractDocumentSplitter
ImporterHandlerException
protected void loadHandlerFromXML(org.apache.commons.configuration.XMLConfiguration xml) throws IOException
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- xml configurationIOException
- could not load from XMLprotected void saveHandlerToXML(EnhancedXMLStreamWriter writer) throws XMLStreamException
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
writer
- the xml writerXMLStreamException
- could not save to XMLpublic boolean equals(Object other)
equals
in class AbstractImporterHandler
public int hashCode()
hashCode
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2021 Norconex Inc.. All rights reserved.