这是我目前面临的挑战
我有很多PDF,我必须删除其中的空白页面,只显示包含内容(文本或图像)的页面。
问题是那些pdf是扫描文件
因此,空白页面的扫描仪会留下一些脏污。
答案 0 :(得分:4)
我做了一些研究并最终得到了这个代码,它将99%的页面检查为白色或浅灰色。 我需要灰色因子,因为扫描的文档有时不是纯白色。
private static Boolean isBlank(PDPage pdfPage) throws IOException {
BufferedImage bufferedImage = pdfPage.convertToImage();
long count = 0;
int height = bufferedImage.getHeight();
int width = bufferedImage.getWidth();
Double areaFactor = (width * height) * 0.99;
for (int x = 0; x < width ; x++) {
for (int y = 0; y < height ; y++) {
Color c = new Color(bufferedImage.getRGB(x, y));
// verify light gray and white
if (c.getRed() == c.getGreen() && c.getRed() == c.getBlue()
&& c.getRed() >= 248) {
count++;
}
}
}
if (count >= areaFactor) {
return true;
}
return false;
}
答案 1 :(得分:0)
http://www.rgagnon.com/javadetails/java-detect-and-remove-blank-page-in-pdf.html
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfDictionary;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
public class RemoveBlankPageFromPDF {
// value where we can consider that this is a blank image
// can be much higher or lower depending of what is considered as a blank page
public static final int BLANK_THRESHOLD = 160;
public static void removeBlankPdfPages(String source, String destination)
throws IOException, DocumentException
{
PdfReader r = null;
RandomAccessSourceFactory rasf = null;
RandomAccessFileOrArray raf = null;
Document document = null;
PdfCopy writer = null;
try {
r = new PdfReader(source);
// deprecated
// RandomAccessFileOrArray raf
// = new RandomAccessFileOrArray(pdfSourceFile);
// itext 5.4.1
rasf = new RandomAccessSourceFactory();
raf = new RandomAccessFileOrArray(rasf.createBestSource(source));
document = new Document(r.getPageSizeWithRotation(1));
writer = new PdfCopy(document, new FileOutputStream(destination));
document.open();
PdfImportedPage page = null;
for (int i=1; i<=r.getNumberOfPages(); i++) {
// first check, examine the resource dictionary for /Font or
// /XObject keys. If either are present -> not blank.
PdfDictionary pageDict = r.getPageN(i);
PdfDictionary resDict = (PdfDictionary) pageDict.get( PdfName.RESOURCES );
boolean noFontsOrImages = true;
if (resDict != null) {
noFontsOrImages = resDict.get( PdfName.FONT ) == null &&
resDict.get( PdfName.XOBJECT ) == null;
}
System.out.println(i + " noFontsOrImages " + noFontsOrImages);
if (!noFontsOrImages) {
byte bContent [] = r.getPageContent(i,raf);
ByteArrayOutputStream bs = new ByteArrayOutputStream();
bs.write(bContent);
System.out.println
(i + bs.size() + " > BLANK_THRESHOLD " + (bs.size() > BLANK_THRESHOLD));
if (bs.size() > BLANK_THRESHOLD) {
page = writer.getImportedPage(r, i);
writer.addPage(page);
}
}
}
}
finally {
if (document != null) document.close();
if (writer != null) writer.close();
if (raf != null) raf.close();
if (r != null) r.close();
}
}
public static void main (String ... args) throws Exception {
removeBlankPdfPages
("C://temp//documentwithblank.pdf", "C://temp//documentwithnoblank.pdf");
}
}
答案 2 :(得分:0)
@Shoyo的代码适用于 PDFBox版本<2.0 。对于将来的读者来说,没有太大的变化,但是,以防万一,这里是 PDFBOX 2.0 + 的代码,使您的生活更轻松。
在您的main
中(主要是指您将PDF加载到PDDocument中的位置)方法:
try {
PDDocument document = PDDocument.load(new File("/home/codemantra/Downloads/tetml_ct_access/C.pdf"));
PDFRenderer renderedDoc = new PDFRenderer(document);
for (int pageNumber = 0; pageNumber < document.getNumberOfPages(); pageNumber++) {
if(isBlank(renderedDoc.renderImage(pageNumber))) {
System.out.println("Blank Page Number : " + pageNumber + 1);
}
}
} catch (Exception e) {
e.printStackTrace();
}
isBlank
方法将只传入BufferedImage
:
private static Boolean isBlank(BufferedImage pageImage) throws IOException {
BufferedImage bufferedImage = pageImage;
long count = 0;
int height = bufferedImage.getHeight();
int width = bufferedImage.getWidth();
Double areaFactor = (width * height) * 0.99;
for (int x = 0; x < width; x++) {
for (int y = 0; y < height; y++) {
Color c = new Color(bufferedImage.getRGB(x, y));
if (c.getRed() == c.getGreen() && c.getRed() == c.getBlue() && c.getRed() >= 248) {
count++;
}
}
}
if (count >= areaFactor) {
return true;
}
return false;
}
所有功劳归@Shoyo
更新:
某些PDF的“此页被故意留为空白” ,以上代码被视为空白。如果这是您的要求,请随时使用上面的代码。但是,我的要求只是过滤掉完全空白的页面(不存在任何图像,也不包含任何字体)。因此,我最终使用了这段代码(加上这段代码运行得更快:P):
public static void main(String[] args) {
try {
PDDocument document = PDDocument.load(new File("/home/codemantra/Downloads/CTP2040.pdf"));
PDPageTree allPages = document.getPages();
Integer pageNumber = 1;
for (PDPage page : allPages) {
Iterable<COSName> xObjects = page.getResources().getXObjectNames();
Iterable<COSName> fonts = page.getResources().getFontNames();
if(xObjects.spliterator().getExactSizeIfKnown() == 0 && fonts.spliterator().getExactSizeIfKnown() == 0) {
System.out.println(pageNumber);
}
pageNumber++;
}
} catch (Exception e) {
e.printStackTrace();
}
}
这将返回完全空白的页面的页码。
希望这对某人有帮助! :)