public class PDFPageSplitter extends AbstractDocumentSplitter implements IXMLConfigurable
Split PDFs pages so each pages are treated as individual documents. May not work on all PDFs (e.g., encrypted PDFs).
The original PDF is kept intact. If you want to eliminate it to keep only
the split pages, make sure to filter it. You can do so by filtering
out PDFs without one of these two fields added to each pages:
document.pdf.pageNumber
or
document.pdf.numberOfPages
. A filtering example:
<filter
class="com.norconex.importer.handler.filter.impl.EmptyFilter"
onMatch="exclude">
<fieldMatcher
matchWhole="true">
document.pdf.pageNumber
</fieldMatcher>
</filter>
By default this splitter restricts its use to
document.contentType
matching application/pdf
.
Should be used as a pre-parse handler.
<handler
class="com.norconex.importer.handler.splitter.impl.PDFPageSplitter">
<!-- multiple "restrictTo" tags allowed (only one needs to match) -->
<restrictTo>
<fieldMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(field-matching expression)
</fieldMatcher>
<valueMatcher
method="[basic|csv|wildcard|regex]"
ignoreCase="[false|true]"
ignoreDiacritic="[false|true]"
partial="[false|true]">
(value-matching expression)
</valueMatcher>
</restrictTo>
<referencePagePrefix>
(String to put before the page number is appended to the document
reference. Default is "#".)
</referencePagePrefix>
</handler>
<handler
class="PDFPageSplitter">
<referencePagePrefix>#page</referencePagePrefix>
</handler>
The above example will split PDFs and will append the page number to the original PDF reference as "#page1", "#page2", etc.
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_REFERENCE_PAGE_PREFIX |
static String |
DOC_PDF_PAGE_NO |
static String |
DOC_PDF_TOTAL_PAGES |
Constructor and Description |
---|
PDFPageSplitter() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
String |
getReferencePagePrefix() |
int |
hashCode() |
protected void |
loadHandlerFromXML(XML xml)
Loads configuration settings specific to the implementing class.
|
protected void |
saveHandlerToXML(XML xml)
Saves configuration settings specific to the implementing class.
|
void |
setReferencePagePrefix(String referencePagePrefix) |
protected List<Doc> |
splitApplicableDocument(HandlerDoc doc,
InputStream input,
OutputStream output,
ParseState parseState) |
String |
toString() |
splitDocument
addRestriction, addRestriction, addRestrictions, clearRestrictions, detectCharsetIfBlank, getRestrictions, isApplicable, loadFromXML, removeRestriction, removeRestriction, saveToXML
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
loadFromXML, saveToXML
public static final String DOC_PDF_PAGE_NO
public static final String DOC_PDF_TOTAL_PAGES
public static final String DEFAULT_REFERENCE_PAGE_PREFIX
public String getReferencePagePrefix()
public void setReferencePagePrefix(String referencePagePrefix)
protected List<Doc> splitApplicableDocument(HandlerDoc doc, InputStream input, OutputStream output, ParseState parseState) throws ImporterHandlerException
splitApplicableDocument
in class AbstractDocumentSplitter
ImporterHandlerException
protected void loadHandlerFromXML(XML xml)
AbstractImporterHandler
loadHandlerFromXML
in class AbstractImporterHandler
xml
- XML configurationprotected void saveHandlerToXML(XML xml)
AbstractImporterHandler
saveHandlerToXML
in class AbstractImporterHandler
xml
- the XMLpublic boolean equals(Object other)
equals
in class AbstractImporterHandler
public int hashCode()
hashCode
in class AbstractImporterHandler
public String toString()
toString
in class AbstractImporterHandler
Copyright © 2009–2023 Norconex Inc.. All rights reserved.