Question

我已经尝试了PDF页面进行成像，但只是提取了PDF页面中的每个图像。不是页面图像。

以下代码：

public class ExtractionPDFtoThumbImgs {

    static String filePath = "/Users/tmdtjq/Downloads/PDFTest/test.pdf";
    static String outputFilePath = "/Users/tmdtjq/Downloads/PDFTest/pageimages";

    public static void change(File inputFile, File outputFolder) throws IOException {
        //TODO check the input file exists and is PDF
        //TODO for the treatment of PDF encrypted
        PDDocument doc = null;
        try {
            doc = PDDocument.load(inputFile);
            List<PDPage> allPages = doc.getDocumentCatalog().getAllPages();
            for (int i = 0; i <allPages.size(); i++) {
                PDPage page = allPages.get(i);
                page.convertToImage();
                BufferedImage image = page.convertToImage();
                ImageIO.write(image, "jpg", new File(outputFolder.getAbsolutePath() + File.separator + (i + 1) + ".jpg"));
            }
        } finally {
            if (doc != null) {
                doc.close();
            }
        }
    }

    public static void main(String[] args) {
        File inputFile = new File(ExtractionPDFtoThumbImgs.filePath);
        File outputFolder = new File(ExtractionPDFtoThumbImgs.outputFilePath);
        if(!outputFolder.exists()){
            outputFolder.mkdirs();
        }
        try {
            ExtractionPDFtoThumbImgs.change(inputFile, outputFolder);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

上面的代码提取PDF页面中的图像。不转换PDF页面中的图像（包含文本）。

是否有转换工具（PDF页面到图像）或转换PDFBox类？

请建议如何获取PDF页面的图像（包含文本）。不要在PDF页面中获取图像。

before converting

after converting

Answer 1

试试pdftocairo，它是poppler的一部分。

我正在使用imagemagick将PDF转换为图片，但它依赖于Ghostscript，有时对你提供的PDF很挑剔，所以它被击中或错过......

到目前为止，pdftocairo一直很稳健。

http://poppler.freedesktop.org

如何获取PDF页面的图像（包含文本）。不是PDF页面中的图像

1 个答案: