PDF文本替换不正常

时间:2015-04-02 04:38:54

标签: java pdf pdfbox

我使用此处给出的代码替换.pdf文件中的文本

public void doIt( String inputFile, String outputFile, String strToFind, String message) throws IOException, COSVisitorException
{
    // the document
    PDDocument doc = null;
    try
    {
        doc = PDDocument.load( inputFile );
        List pages = doc.getDocumentCatalog().getAllPages();
        for( int i=0; i<pages.size(); i++ )
        {
            PDPage page = (PDPage)pages.get( i );
            PDStream contents = page.getContents();
            PDFStreamParser parser = new PDFStreamParser(contents.getStream() );
            parser.parse();
            List tokens = parser.getTokens();
            for( int j=0; j<tokens.size(); j++ )
            {
                Object next = tokens.get( j );
                if( next instanceof PDFOperator )
                {
                    PDFOperator op = (PDFOperator)next;
                    //Tj and TJ are the two operators that display
                    //strings in a PDF
                    if( op.getOperation().equals( "Tj" ) )
                    {
                        //Tj takes one operator and that is the string
                        //to display so lets update that operator
                        COSString previous = (COSString)tokens.get( j-1 );
                        String string = previous.getString();
                        if (string.matches(strToFind)) {
                            string = string.replaceFirst( strToFind, message );
                            previous.reset();
                            previous.append( string.getBytes("ISO-8859-1") );
                        }
                    }
                    else if( op.getOperation().equals( "TJ" ) )
                    {
                        COSArray previous = (COSArray)tokens.get( j-1 );
                        for( int k=0; k<previous.size(); k++ )
                        {
                            Object arrElement = previous.getObject( k );
                            if( arrElement instanceof COSString )
                            {
                                COSString cosString = (COSString)arrElement;
                                String string = cosString.getString();
                                if (string.matches(strToFind)) 
                                {
                                    string = string.replaceFirst( strToFind, message );
                                    cosString.reset();
                                    cosString.append( string.getBytes("ISO-8859-1") );
                                }
                            }
                        }
                    }
                }
            }
            //now that the tokens are updated we will replace the
            //page content stream.
            PDStream updatedStream = new PDStream(doc);
            OutputStream out = updatedStream.createOutputStream();
            ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
            tokenWriter.writeTokens( tokens );
            page.setContents( updatedStream );
        }
        doc.save( outputFile );
    }
    finally
    {
        if( doc != null )
        {
            doc.close();
        }
    }
}

此代码适用于极少数格式。对于大多数格式,它会像这样读取文本

A
d
obe 
A
c
r
obat PDF Fi
les
Adobe® Portable Do
cument F
o
rm
at (PDF
) is 
a universal 
file 
format 

由于这种奇怪的阅读,替换功能无效。请告诉我如何修改我的代码以使用所有格式。

0 个答案:

没有答案