使用pdfbox替换pdf中的文本时出现错误字符

时间:2015-10-28 11:46:54

标签: java pdf pdfbox

我试图替换pdf中的文字并将其替换,这是我的代码

PDDocument doc = null;
    int occurrences = 0;
    try {
        doc = PDDocument.load("test.pdf"); //Input PDF File Name
        List pages = doc.getDocumentCatalog().getAllPages();
        for (int i = 0; i < pages.size(); i++) {
            PDPage page = (PDPage) pages.get(i);
            PDStream contents = page.getContents();
            PDFStreamParser parser = new PDFStreamParser(contents.getStream());
            parser.parse();
            List tokens = parser.getTokens();
            for (int j = 0; j < tokens.size(); j++) {
                Object next = tokens.get(j);
                if (next instanceof PDFOperator) {
                    PDFOperator op = (PDFOperator) next;
                    // Tj and TJ are the two operators that display strings in a PDF
                    if (op.getOperation().equals("Tj")) {
                        // Tj takes one operator and that is the string
                        // to display so lets update that operator
                        COSString previous = (COSString) tokens.get(j - 1);
                        String string = previous.getString();
                        if (string.contains("Good")) {
                            string = string.replace("Good", "Bad");
                            occurrences++;
                        }
                        //Word you want to change. Currently this code changes word "Good" to "Bad"
                        previous.reset();
                        previous.append(string.getBytes("ISO-8859-1"));
                    } else if (op.getOperation().equals("TJ")) {
                        COSArray previous = (COSArray) tokens.get(j - 1);
                        COSString temp = new COSString();

                        String tempString = "";
                        for (int t = 0; t < previous.size(); t++) {

                            if (previous.get(t) instanceof COSString) {
                                tempString += ((COSString) previous.get(t)).getString();

                            }
                        }

                        temp.append(tempString.getBytes("ISO-8859-1"));
                        tempString = "";
                        tempString = temp.getString();
                        if (tempString.contains("Good")) {
                            tempString = tempString.replace("Good", "Bad");
                            occurrences++;
                        }
                        previous.clear();

                        String[] stringArray = tempString.split(" ");

                        for (String string : stringArray) {
                            COSString cosString = new COSString();
                            string = string + " ";
                            cosString.append(string.getBytes("ISO-8859-1"));
                            previous.add(cosString);
                        }

                    }
                }
            }
            // now that the tokens are updated we will replace the page content stream.
            PDStream updatedStream = new PDStream(doc);
            OutputStream out = updatedStream.createOutputStream();
            ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
            tokenWriter.writeTokens(tokens);
            page.setContents(updatedStream);
        }
        System.out.println("number of matches found: " + occurrences);
        doc.save("a.pdf"); //Output file name
    } catch (IOException ex) {
        Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
    } catch (COSVisitorException ex) {
        Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
    } finally {
        if (doc != null) {
            try {
                doc.close();
            } catch (IOException ex) {
                Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
            }
        }
    }

它被替换为坏字符或隐藏形状的问题(例如坏字只变成d字符),但如果我将其复制并粘贴到另一个地方,它会正确粘贴预期的单词, 当我在生成的pdf中搜索新单词时,它也找不到它,但是当我用旧单词搜索时,它会在替换的地方找到它

1 个答案:

答案 0 :(得分:-2)

我发现aspose,此链接显示了如何使用它来替换pdf中的文本,它很容易并且完美无缺,除非它不是免费的,所以免费版本是打印版权线在头上pdf文件页面 http://www.aspose.com/docs/display/pdfjava/Replace+Text+in+Pages+of+a+PDF+Document