如何替换/删除PDF文件中的文本

时间:2018-03-26 11:18:28

标签: pdf pdf-generation

如何更换/删除PDF文件中的文字?

我有一个PDF文件,我在某处获得,我希望能够替换其中的一些文本。

或者,我有一个PDF文件,我想隐藏(编辑)其中的一些文本,以便它不再可见[并且它看起来很酷,就像CIA文件一样]。

或者,我有一个包含全局Javascript的PDF,我想停止使用PDF。

2 个答案:

答案 0 :(得分:2)

使用iText / iTextSharp可以有限的方式实现。 它只适用于Tj / TJ操作码(即标准文本,不是嵌入图像的文本,也不是用形状绘制)。

您需要覆盖默认的PdfContentStreamProcessor以对页面内容流进行操作,如Mkl Removing Watermark from PDF iTextSharp所示。从这个类继承,并在你的新类中寻找Tj / TJ操作码,操作数通常是文本元素(对于TJ,这可能不是简单的文本,可能需要进一步解析所有操作数。)。

关于iTextSharp的一些灵活性的一个非常基本的例子可以从这个github存储库https://github.com/bevanweiss/PdfEditor获得(以下代码摘录)

注意:这使用了iTPLTharp的AGPL版本(因此也是AGPL),因此如果您要分发从此代码派生的可执行文件或允许其他人以任何方式与这些可执行文件交互,那么您还必须提供修改后的源代码码。此代码也不保证,暗示或表达。使用自负。

PdfContentStreamEditor

using System.Collections.Generic;

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFCleaner
{
    public class PdfContentStreamEditor : PdfContentStreamProcessor
    {
        /**
         * This method edits the immediate contents of a page, i.e. its content stream.
         * It explicitly does not descent into form xobjects, patterns, or annotations.
         */
        public void EditPage(PdfStamper pdfStamper, int pageNum)
        {
            var pdfReader = pdfStamper.Reader;
            var page = pdfReader.GetPageN(pageNum);
            var pageContentInput = ContentByteUtils.GetContentBytesForPage(pdfReader, pageNum);
            page.Remove(PdfName.CONTENTS);
            EditContent(pageContentInput, page.GetAsDict(PdfName.RESOURCES), pdfStamper.GetUnderContent(pageNum));
        }

        /**
         * This method processes the content bytes and outputs to the given canvas.
         * It explicitly does not descent into form xobjects, patterns, or annotations.
         */
        public virtual void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
        {
            this.Canvas = canvas;
            ProcessContent(contentBytes, resources);
            this.Canvas = null;
        }

        /**
         * This method writes content stream operations to the target canvas. The default
         * implementation writes them as they come, so it essentially generates identical
         * copies of the original instructions the {@link ContentOperatorWrapper} instances
         * forward to it.
         *
         * Override this method to achieve some fancy editing effect.
         */

        protected virtual void Write(PdfContentStreamProcessor processor, PdfLiteral operatorLit, List<PdfObject> operands)
        {
            var index = 0;

            foreach (var pdfObject in operands)
            {
                pdfObject.ToPdf(null, Canvas.InternalBuffer);
                Canvas.InternalBuffer.Append(operands.Count > ++index ? (byte) ' ' : (byte) '\n');
            }
        }


        //
        // constructor giving the parent a dummy listener to talk to 
        //
        public PdfContentStreamEditor() : base(new DummyRenderListener())
        {
        }

        //
        // constructor giving the parent a dummy listener to talk to 
        //
        public PdfContentStreamEditor(IRenderListener renderListener) : base(renderListener)
        {
        }

        //
        // Overrides of PdfContentStreamProcessor methods
        //

        public override IContentOperator RegisterContentOperator(string operatorString, IContentOperator newOperator)
        {
            var wrapper = new ContentOperatorWrapper();
            wrapper.SetOriginalOperator(newOperator);
            var formerOperator = base.RegisterContentOperator(operatorString, wrapper);
            return (formerOperator is ContentOperatorWrapper operatorWrapper ? operatorWrapper.GetOriginalOperator() : formerOperator);
        }

        public override void ProcessContent(byte[] contentBytes, PdfDictionary resources)
        {
            this.Resources = resources; 
            base.ProcessContent(contentBytes, resources);
            this.Resources = null;
        }

        //
        // members holding the output canvas and the resources
        //
        protected PdfContentByte Canvas = null;
        protected PdfDictionary Resources = null;

        //
        // A content operator class to wrap all content operators to forward the invocation to the editor
        //
        class ContentOperatorWrapper : IContentOperator
        {
            public IContentOperator GetOriginalOperator()
            {
                return _originalOperator;
            }

            public void SetOriginalOperator(IContentOperator op)
            {
                this._originalOperator = op;
            }

            public void Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
            {
                if (_originalOperator != null && !"Do".Equals(oper.ToString()))
                {
                    _originalOperator.Invoke(processor, oper, operands);
                }
                ((PdfContentStreamEditor)processor).Write(processor, oper, operands);
            }

            private IContentOperator _originalOperator = null;
        }

        //
        // A dummy render listener to give to the underlying content stream processor to feed events to
        //
        class DummyRenderListener : IRenderListener
        {
            public void BeginTextBlock() { }

            public void RenderText(TextRenderInfo renderInfo) { }

            public void EndTextBlock() { }

            public void RenderImage(ImageRenderInfo renderInfo) { }
        }
    }
}

TextReplaceStreamEditor

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFCleaner
{
    public class TextReplaceStreamEditor : PdfContentStreamEditor
    {
        public TextReplaceStreamEditor(string MatchPattern, string ReplacePattern)
        {
            _matchPattern = MatchPattern;
            _replacePattern = ReplacePattern;
        }

        private string _matchPattern;
        private string _replacePattern;

        protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
        {
            var operatorString = oper.ToString();
            if ("Tj".Equals(operatorString) || "TJ".Equals(operatorString))
            {
                for(var i = 0; i < operands.Count; i++)
                {
                    if(!operands[i].IsString())
                        continue;

                    var text = operands[i].ToString();
                    if(Regex.IsMatch(text, _matchPattern))
                    {
                        operands[i] = new PdfString(Regex.Replace(text, _matchPattern, _replacePattern));
                    }
                }
            }

            base.Write(processor, oper, operands);
        }
    }
}

TextRedactStreamEditor

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFCleaner
{
    public class TextRedactStreamEditor : PdfContentStreamEditor
    {
        public TextRedactStreamEditor(string MatchPattern) : base(new RedactRenderListener(MatchPattern))
        {
            _matchPattern = MatchPattern;
        }

        private string _matchPattern;

        protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
        {
            base.Write(processor, oper, operands);
        }

        public override void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
        {
            ((RedactRenderListener)base.RenderListener).SetCanvas(canvas);
            base.EditContent(contentBytes, resources, canvas);
        }
    }

    //
    // A pretty simple render listener, all we care about it text stuff.
    // We listen out for text blocks, look for our text, and then put a
    // black box over it.. text 'redacted'
    //
    class RedactRenderListener : IRenderListener
    {
        private PdfContentByte _canvas;
        private string _matchPattern;

        public RedactRenderListener(string MatchPattern)
        {
            _matchPattern = MatchPattern;
        }

        public RedactRenderListener(PdfContentByte Canvas, string MatchPattern)
        {
            _canvas = Canvas;
            _matchPattern = MatchPattern;
        }

        public void SetCanvas(PdfContentByte Canvas)
        {
            _canvas = Canvas;
        }

        public void BeginTextBlock() { }

        public void RenderText(TextRenderInfo renderInfo)
        {
            var text = renderInfo.GetText();

            var match = Regex.Match(text, _matchPattern);
            if(match.Success)
            {
                var p1 = renderInfo.GetCharacterRenderInfos()[match.Index].GetAscentLine().GetStartPoint();
                var p2 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetAscentLine().GetEndPoint();
                var p3 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetDescentLine().GetEndPoint();
                var p4 = renderInfo.GetCharacterRenderInfos()[match.Index].GetDescentLine().GetStartPoint();

                _canvas.SaveState();
                _canvas.SetColorStroke(BaseColor.BLACK);
                _canvas.SetColorFill(BaseColor.BLACK);
                _canvas.MoveTo(p1[Vector.I1], p1[Vector.I2]);
                _canvas.LineTo(p2[Vector.I1], p2[Vector.I2]);
                _canvas.LineTo(p3[Vector.I1], p3[Vector.I2]);
                _canvas.LineTo(p4[Vector.I1], p4[Vector.I2]);
                _canvas.ClosePathFillStroke();
                _canvas.RestoreState();
            }
        }

        public void EndTextBlock() { }

        public void RenderImage(ImageRenderInfo renderInfo) { }
    }
}

将它们与iTextSharp一起使用

var reader = new PdfReader("SRC FILE PATH GOES HERE");
var dstFile = File.Open("DST FILE PATH GOES HERE", FileMode.Create);

pdfStamper = new PdfStamper(reader, output, reader.PdfVersion, false);

// We don't need to auto-rotate, as the PdfContentStreamEditor will already deal with pre-rotated space..
// if we enable this we will inadvertently rotate the content.
pdfStamper.RotateContents = false;

// This is for the Text Replace
var replaceTextProcessor = new TextReplaceStreamEditor(
    "TEXT TO REPLACE HERE",
    "TEXT TO SUBSTITUTE IN HERE");

for(int i=1; i <= reader.NumberOfPages; i++)
    replaceTextProcessor.EditPage(pdfStamper, i);


// This is for the Text Redact
var redactTextProcessor = new TextRedactStreamEditor(
    "TEXT TO REDACT HERE");
for(int i=1; i <= reader.NumberOfPages; i++)
    redactTextProcessor.EditPage(pdfStamper, i);
// Since our redacting just puts a box over the top, we should secure the document a bit... just to prevent people copying/pasting the text behind the box.. we also prevent text to speech processing of the file, otherwise the 'hidden' text will be spoken
pdfStamper.Writer.SetEncryption(null, 
    Encoding.UTF8.GetBytes("ownerPassword"),
    PdfWriter.AllowDegradedPrinting | PdfWriter.AllowPrinting,
    PdfWriter.ENCRYPTION_AES_256);

// hey, lets get rid of Javascript too, because it's annoying
pdfStamper.Javascript = "";


// and then finally we close our files (saving it in the process) 
pdfStamper.Close();
reader.Close();

答案 1 :(得分:0)

您可以使用 GroupDocs.Redaction (适用于.NET)替换或删除PDF文档中的文本。您可以对文本执行精确的词组区分大小写和正则表达式删除。以下代码段在加载的PDF文档中将“ 糖果”替换为“ [已编辑] ”。

C#:

public class list_adapter extends ArrayAdapter<studID> {

    private Context mContext;
    private List<studID> studList = new ArrayList<>();

    public list_adapter(@NonNull Context context, @LayoutRes ArrayList<studID> list) {
        super(context, 0, list);
        mContext = context;
        studList = list;
    }

    @NonNull
    @Override
    public View getView(int position, @Nullable View convertView, @NonNull ViewGroup parent) {
        View list_Item = convertView;
        if (list_Item == null)
            list_Item = LayoutInflater.from(mContext).inflate(R.layout.list_template, parent, false);

        studID studentID = studList.get(position);

        TextView name = (TextView) list_Item.findViewById(R.id.text_student);
        name.setText("Student : " + studentID.getmstudent());

        TextView release = (TextView) list_Item.findViewById(R.id.text_ID);
        release.setText("ID : " + studentID.getmID());

        return list_Item;
    }
}

披露:我是GroupDocs的开发人员布道者。