Question

在使用pdfbox提取pdf时，我在字母（空格）之间得到了空格。这似乎是由于使用了Couriernew字体而导致的，因为提取与其他字体一起正常工作。应用程序在AWS lambda上运行。我还可以在日志中看到仅针对该特定pdf的错误“无法写入字体缓存java.io.FileNotFoundException：/home/user/.pdfbox.cache”。

我尝试将PDDocument字体默认设置为arial。

PDFont font = PDTrueTypeFont.loadTTF(_PDdoc, new File("C:\\Windows\\FONTS\\arial.ttf"));
for (int i = 0; i < _PDdoc.getNumberOfPages(); ++i) {
            PDPage page1 = _PDdoc.getPage(i);
            PDResources res = page1.getResources();
            for (COSName fontName : res.getFontNames()) {
                res.put(fontName, font);
            }
        }

但这不能按预期工作。在本地计算机中，没有缓存问题。任何线索将不胜感激。

尝试实施Apache PDFBox Remove Spaces between characters中提供的解决方案。

String extractNoSpaces(PDDocument document,String regionName,PDPage page) throws IOException
{
    PDFTextStripperByArea pts = new PDFTextStripperByArea() {
        @Override
        protected void processTextPosition(TextPosition text)
        {
            int[] character = text.getCharacterCodes();
            //check for space
        }
    };      
                        pts = _PDFTextStripperByAreaMap.get(regionName);
                        pts.setSortByPosition(true);
                        pts.extractRegions(page);
                        return pts.getTextForRegion(regionName);
}

文档中没有提供太多有关getCharacterCodes（）的信息，并且上述方法也未执行。

提取pdf时在字母之间获取空格

0 个答案: