pdfbox,cpu 100%,同时提取文本

时间:2018-04-03 13:24:54

标签: java pdfbox

我正在使用pdfbox 2.0.1解析这样的pdf文档。

        for (int i = 0; i < 5; i ++) {
            new Thread(new Runnable() {
                @Override
                public void run() {
                    InputStream in = new ByteArrayInputStream(fileContent);
                    PDDocument document = null;
                    PDFTextStripper stripper;
                    String content;

                    try {
                        document = PDDocument.load(in);

                        stripper = new PDFTextStripper();
                        content = stripper.getText(document).trim();
                    } finally {
                        if (document != null) {
                            document.close();
                        }
                        if (in != null) {
                            in.close();
                        }
                    }
                    System.out.println(content);
                }
            }).start();
        }

有时,在同时解析pdf时,cpu运行100%。堆栈如下:

java.lang.Thread.State: RUNNABLE
at java.util.HashMap.get(HashMap.java:303)
at org.apache.pdfbox.pdmodel.font.encoding.GlyphList.toUnicode(GlyphList.java:231)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:308)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:273)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:668)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:609)
at org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:52)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:136)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)

GlyphList.java代码是:

// Adobe Glyph List (AGL)
private static final GlyphList DEFAULT = load("glyphlist.txt", 4281);


 /**
     * Returns the Unicode character sequence for the given glyph name, or null if there isn't any.
     *
     * @param name PostScript glyph name
     * @return Unicode character(s), or null.
     */
public String toUnicode(String name)
{
    if (name == null)
    {
        return null;
    }

    String unicode = nameToUnicode.get(name);
    if (unicode != null)
    {
        return unicode;
    }

    // separate read/write cache for thread safety
    unicode = uniNameToUnicodeCache.get(name);
    if (unicode == null)
    {
        // test if we have a suffix and if so remove it
        if (name.indexOf('.') > 0)
        {
            unicode = toUnicode(name.substring(0, name.indexOf('.')));
        }
        else if (name.startsWith("uni") && name.length() == 7)
        {
            // test for Unicode name in the format uniXXXX where X is hex
            int nameLength = name.length();
            StringBuilder uniStr = new StringBuilder();
            try
            {
                for (int chPos = 3; chPos + 4 <= nameLength; chPos += 4)
                {
                    int codePoint = Integer.parseInt(name.substring(chPos, chPos + 4), 16);
                    if (codePoint > 0xD7FF && codePoint < 0xE000)
                    {
                        LOG.warn("Unicode character name with disallowed code area: " + name);
                    }
                    else
                    {
                        uniStr.append((char) codePoint);
                    }
                }
                unicode = uniStr.toString();
            }
            catch (NumberFormatException nfe)
            {
                LOG.warn("Not a number in Unicode character name: " + name);
            }
        }
        else if (name.startsWith("u") && name.length() == 5)
        {
            // test for an alternate Unicode name representation uXXXX
            try
            {
                int codePoint = Integer.parseInt(name.substring(1), 16);
                if (codePoint > 0xD7FF && codePoint < 0xE000)
                {
                    LOG.warn("Unicode character name with disallowed code area: " + name);
                }
                else
                {
                    unicode = String.valueOf((char) codePoint);
                }
            }
            catch (NumberFormatException nfe)
            {
                LOG.warn("Not a number in Unicode character name: " + name);
            }
        }
        uniNameToUnicodeCache.put(name, unicode);
    }
    return unicode;
}

所以,当我们这样打电话时

GlyphList.DEFAULT.toUnicode(code)

发生并发错误(注意var uniNameToUnicodeCache),PDSimpleFont.toUnicode就这样做了。

然而,似乎没有其他人遇到同样的问题。我不知道我上面说的是对还是错。如果它确实是一个错误,它是否已修复?

1 个答案:

答案 0 :(得分:2)

回顾GlyphList类代码,很明显它还没有为多线程使用做好准备。另一方面,它的DEFAULT实例通过文本提取代码同时通过getAdobeGlyphList用作单例。

如果相关文档使用非正式方案toUnicode(String)uniXXXX使用字形名称,这可能会成为uXXXX方法中的问题,因为在这种情况下此方法不仅会尝试从HashMap uniNameToUnicodeCache读取,但也写入(添加找到的非正式字形名称以便以后快速查找)。

如果这样的写入与从地图读取的其他某个线程同时发生,确实可能会发生ConcurrentModificationException

我建议将GlyphList更改为

  • 不再写uniNameToUnicodeCache
  • 同步toUnicode(String)或更准确地uniNameToUnicodeCache在其中进行读写,或
  • 使uniNameToUnicodeCache成为ConcurrentHashMap而不是HashMap

我希望第三个选项的表现要好于第二个选项。