尝试使用iTextSharp从PDF中删除内嵌图像时出现问题

时间:2014-07-13 18:11:30

标签: pdf itextsharp

我最近发现了iTextSharp。

我正在调查PDF文档呈现的性能问题,而Bruno Lowagie(iText的作者)向我解释了我遇到这样一个问题的原因:这是由于" Inline Images& #34;在我的PDF文档中。他还解释了删除那些"内嵌图像的基础知识" ...(我的目的是"可能"显示文档的预览,并清楚地注意到它不是实际的文件,这个文件打开的速度很慢。我清楚地知道我要做的事情远非强大/安全/ ......问题必须在另一个层面解决,例如:在生成文档时, ...)

不幸的是,我没有成功实现自己的清理工作:/ 这是我目前的一些代码(灵感来自stackOverflow上的各种样本)......

PdfReader pdfReader = new PdfReader(filename);
try
{  
    //pdfReader.RemoveUnusedObjects();

    var cleanfilename = filename.Replace(".pdf", ".clean.pdf");
    if (File.Exists(cleanfilename))
        File.Delete(cleanfilename);

    using (var file = new FileStream(cleanfilename, FileMode.Create))
    {
        var pdfstamper = new PdfStamper(pdfReader, file);

        for (var page = 1; page <= pdfReader.NumberOfPages; page++)
        {    
            PdfDictionary pageDict = pdfReader.GetPageN(page);
            PdfObject pageObj = pageDict.GetDirectObject(PdfName.CONTENTS);
            if (pageObj.IsStream())
            {
                CleanStream(pageObj);
            }
            else if (pageObj.IsArray())
            {
                PdfArray pageArray = pageDict.GetAsArray(PdfName.CONTENTS);

                for (int j = 0; j < pageArray.Size; j++)
                {
                    PdfIndirectReference arrayElement = (PdfIndirectReference)pageArray[j];
                    pageObj = pdfReader.GetPdfObject(arrayElement.Number);
                    if (pageObj.IsStream())
                    {
                        CleanStream(pageObj);
                    }
                }
            }
        }

        pdfstamper.Close();
    }
}
catch (Exception ex)
{
    MessageBox.Show("Error: " + ex.Message, "Error");
}
finally
{
    pdfReader.Close();
}

Regex regEx = new Regex("\\nBI.*?\\nEI", RegexOptions.Compiled);

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    var currentContent = Encoding.ASCII.GetString(data);    
    var newContent = regEx.Replace(currentContent, "");
    var newData = Encoding.ASCII.GetBytes(newContent);

    stream.SetData(newData);
}

它在没有内联图像的PDF上工作正常......但是&#34; Text&#34;正在从有内联图像的页面中消失。

我认为问题在于替换。但据我所知,情况并非如此。 使用以下代码(passthrough类型),输出文档是可以的:

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    stream.SetData(data);
}

但是使用以下代码,理论上不会更改任何字节(是吗?),输出文档不再显示(某些内容似乎无法呈现)?!?!?

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    var currentContent = Encoding.ASCII.GetString(data);    
    var newData = Encoding.ASCII.GetBytes(currentContent);

    stream.SetData(newData);
}

我看起来像将字节数组转换为字符串并返回数组不是&#34;透明&#34;操作

我真的不明白!?!但另一方面,我知道我是关于PDF的真正的初学者。 我错过了什么?

这一点都不重要(如果我能够成功删除这些内嵌图像,我真的不在乎)。但我现在真的很想知道发生了什么:D

以下是PDF示例: https://drive.google.com/file/d/0Byqch0ZyIb5DWDdmSTJ3SDMxMW8/edit?usp=sharing

2 个答案:

答案 0 :(得分:1)

正如您已经发现的那样,正如我在评论中指出的那样,在不查看每个运算符的情况下操作内容流并不是一个好主意在溪流中。您确实需要解析语法并解释每个运算符和每个操作数。

请查看com.itextpdf.text.pdf.ocg/包中随iText提供的额外jar中的OCG删除功能。

OCGParser类中,我们定义了所有可能的运算符:

protected void populateOperators() {
    if (operators != null)
        return;
    operators = new HashMap<String, PdfOperator>();
    operators.put(DEFAULTOPERATOR, new CopyContentOperator());
    PathConstructionOrPaintingOperator opConstructionPainting = new PathConstructionOrPaintingOperator();
    operators.put("m", opConstructionPainting);
    operators.put("l", opConstructionPainting);
    operators.put("c", opConstructionPainting);
    operators.put("v", opConstructionPainting);
    operators.put("y", opConstructionPainting);
    operators.put("h", opConstructionPainting);
    operators.put("re", opConstructionPainting);
    operators.put("S", opConstructionPainting);
    operators.put("s", opConstructionPainting);
    operators.put("f", opConstructionPainting);
    operators.put("F", opConstructionPainting);
    operators.put("f*", opConstructionPainting);
    operators.put("B", opConstructionPainting);
    operators.put("B*", opConstructionPainting);
    operators.put("b", opConstructionPainting);
    operators.put("b*", opConstructionPainting);
    operators.put("n", opConstructionPainting);
    operators.put("W", opConstructionPainting);
    operators.put("W*", opConstructionPainting);
    GraphicsOperator graphics = new GraphicsOperator();
    operators.put("q", graphics);
    operators.put("Q", graphics);
    operators.put("w", graphics);
    operators.put("J", graphics);
    operators.put("j", graphics);
    operators.put("M", graphics);
    operators.put("d", graphics);
    operators.put("ri", graphics);
    operators.put("i", graphics);
    operators.put("gs", graphics);
    operators.put("cm", graphics);
    operators.put("g", graphics);
    operators.put("G", graphics);
    operators.put("rg", graphics);
    operators.put("RG", graphics);
    operators.put("k", graphics);
    operators.put("K", graphics);
    operators.put("cs", graphics);
    operators.put("CS", graphics);
    operators.put("sc", graphics);
    operators.put("SC", graphics);
    operators.put("scn", graphics);
    operators.put("SCN", graphics);
    operators.put("sh", graphics);
    XObjectOperator xObject = new XObjectOperator();
    operators.put("Do", xObject);
    InlineImageOperator inlineImage = new InlineImageOperator();
    operators.put("BI", inlineImage);
    operators.put("EI", inlineImage);
    TextOperator text = new TextOperator();
    operators.put("BT", text);
    operators.put("ID", text);
    operators.put("ET", text);
    operators.put("Tc", text);
    operators.put("Tw", text);
    operators.put("Tz", text);
    operators.put("TL", text);
    operators.put("Tf", text);
    operators.put("Tr", text);
    operators.put("Ts", text);
    operators.put("Td", text);
    operators.put("TD", text);
    operators.put("Tm", text);
    operators.put("T*", text);
    operators.put("Tj", text);
    operators.put("'", text);
    operators.put("\"", text);
    operators.put("TJ", text);
    MarkedContentOperator markedContent = new MarkedContentOperator();
    operators.put("BMC", markedContent);
    operators.put("BDC", markedContent);
    operators.put("EMC", markedContent);
}

parse()方法将查看所有内容流,包括Form XObjects的内容流(如果我正确理解您的代码,您可以忽略它)。

process()方法中,我们复制每个运算符及其所有操作数,除非某些条件告诉我们需要删除部分语法。

您应该调整此代码,以便复制所有操作符,但涉及内嵌图像的操作符除外。你的方法是一种蛮力的方法,必然会破坏比以往更多的PDF文件。

答案 1 :(得分:0)

我没有处理字符串,而是直接在字节上工作......

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);
    var workingData = new byte[data.Length];

    var BI = Encoding.ASCII.GetBytes("\nBI");
    var EI = Encoding.ASCII.GetBytes("\nEI");

    var len = EI.Length - 1;
    var BIpos = data.Locate(BI);
    var EIpos = data.Locate(EI);
    var pos = BIpos.Length;
    if (pos != EIpos.Length)
        throw new Exception("BI and EI operators not matching ?!");

    var skip = 0;
    var newI = 0;
    for (var i = 0; i < data.Length; i++)
    {
        if (skip >= pos || i < BIpos[skip])
        {
            workingData[newI] = data[i];
            newI++;
        }
        else if (i >= EIpos[skip] + len)
            skip++;
    }

    var newData = new byte[newI];
    Array.Copy(workingData, newData, newI);

    stream.SetData(newData);
}

“定位”是此处建议的扩展方法:byte[] array pattern search

欢迎对此解决方案发表任何评论!