声明：

Question

我有一个PDF文件，我正在使用ITextExtractionStrategy.Now从字符串中读取字符串我正在使用像My name is XYZ这样的子字符串，需要从PDF文件中获取子字符串的直角坐标，但不能这样做它。谷歌搜索我知道LocationTextExtractionStrategy，但没有得到如何使用它来获得坐标。

这是代码..

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);

string getcoordinate="My name is XYZ";

如何使用ITEXTSHARP获取此子字符串的直角坐标。

请帮忙。

Answer 1

这是一个非常非常简单的实现版本。

在实施之前非常非常重要，因为我们知道PDF没有＆＃34;单词＆＃34;，＆＃34;段落＆＃34;，＆＃34;句子＆＃34;等。此外，PDF中的文本不一定是从左到右，从上到下排列，这与非LTR语言无关。短语＆＃34; Hello World＆＃34;可以写成PDF格式：

Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)

它也可以写成

Draw Hello World at (10,10)

您需要实现的ITextExtractionStrategy接口有一个名为RenderText的方法，可以为PDF中的每个文本块调用一次。请注意我说＆＃34; chunk＆＃34;而不是＆＃34;字＆＃34;。在上面的第一个例子中，对于这两个单词，该方法将被调用四次。在第二个例子中，对于这两个单词，它将被调用一次。这是要理解的非常重要的部分。 PDF没有文字，因此，iTextSharp也没有文字。＆＃34;字＆＃34;部分是100％由您来解决。

同样沿着这些方向，正如我上面所说，PDF没有段落。要注意这一点的原因是因为PDF无法将文本换行到新行。每当您看到看起来像段落的内容时，您实际上会看到一个全新的文本绘制命令，该命令与前一行具有不同的y坐标。请参阅this for further discussion。

下面的代码是一个非常简单的实现。对于它，我是已经实现LocationTextExtractionStrategy的{{1}}的子类。在每次调用ITextExtractionStrategy时，我找到当前块的矩形（使用Mark's code here）并将其存储以供日后使用。我使用这个简单的帮助类来存储这些块和矩形：

RenderText()

这是子类：

//Helper class that stores our rectangle and text
public class RectAndText {
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public RectAndText(iTextSharp.text.Rectangle rect, String text) {
        this.Rect = rect;
        this.Text = text;
    }
}

最后是上面的实现：

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //Get the bounding box for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
    }
}

我不能强调上述不采取＆＃34;字＆＃34;考虑到，这将取决于你。传递到//Our test file var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf"); //Create our test file, nothing special using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) { using (var doc = new Document()) { using (var writer = PdfWriter.GetInstance(doc, fs)) { doc.Open(); doc.Add(new Paragraph("This is my sample file")); doc.Close(); } } } //Create an instance of our strategy var t = new MyLocationTextExtractionStrategy(); //Parse page 1 of the document above using (var r = new PdfReader(testFile)) { var ex = PdfTextExtractor.GetTextFromPage(r, 1, t); } //Loop through each chunk found foreach (var p in t.myPoints) { Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom)); }的{{1}}对象有一个名为TextRenderInfo的方法，您可以使用该方法获取更多信息。如果您不关心字体中的下延，您可能还想使用RenderText GetDescentLine（）`。

修改

（我吃了很棒的午餐，所以我感觉更有帮助。）

这里是GetCharacterRenderInfos()的更新版本，它执行我在下面的评论所说的内容，即它需要一个字符串来搜索并搜索每个字符串以查找该字符串。由于列出的所有原因，这在某些/多个/大多数/所有情况下都不起作用。如果子串在单个块中多次存在，它也将仅返回第一个实例。连字和变音符号也可能会弄乱这个。

GetBaseline() instead of

您可以像以前一样使用它，但现在构造函数只有一个必需参数：

MyLocationTextExtractionStrategy

Answer 2

这是一个老问题，但我在此留下我的回复，因为我在网上找不到正确答案。

正如克里斯·哈斯（Chris Haas）所揭露的那样，因为iText处理大块的问题并不容易处理。 Chris在我的大部分测试中失败的代码，因为一个单词通常在不同的块中被分割（他在帖子中警告过）。

要解决这个问题，我采用的策略是：

以字符分割块（实际上每个字符的textrenderinfo对象）
按行分组。这不是直截了当的，因为你必须处理块对齐。
搜索每行所需的字词

我在这里留下代码。我用几个文件测试它并且它工作得很好但是在某些情况下可能会失败，因为这个块有点棘手 - ＆gt;单词转换。

希望对某人有所帮助。

  class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
{
    private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>();
    private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>();
    public List<SearchResult> m_SearchResultsList = new List<SearchResult>();
    private String m_SearchText;
    public const float PDF_PX_TO_MM = 0.3528f;
    public float m_PageSizeY;


    public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY)
        : base()
    {
        this.m_SearchText = sSearchText;
        this.m_PageSizeY = fPageSizeY;
    }

    private void searchText()
    {
        foreach (LineInfo aLineInfo in m_LinesTextInfo)
        {
            int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText);
            if (iIndex != -1)
            {
                TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex);
                SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY);
                this.m_SearchResultsList.Add(aSearchResult);
            }
        }
    }

    private void groupChunksbyLine()
    {                     
        LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null;
        LocationTextExtractionStrategyEx.LineInfo textInfo = null;
        foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks)
        {
            if (textChunk1 == null)
            {                    
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            else if (textChunk2.sameLine(textChunk1))
            {                      
                textInfo.appendText(textChunk2);
            }
            else
            {                                        
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            textChunk1 = textChunk2;
        }
    }

    public override string GetResultantText()
    {
        groupChunksbyLine();
        searchText();
        //In this case the return value is not useful
        return "";
    }

    public override void RenderText(TextRenderInfo renderInfo)
    {
        LineSegment baseline = renderInfo.GetBaseline();
        //Create ExtendedChunk
        ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList());
        this.m_DocChunks.Add(aExtendedChunk);
    }

    public class ExtendedTextChunk
    {
        public string m_text;
        private Vector m_startLocation;
        private Vector m_endLocation;
        private Vector m_orientationVector;
        private int m_orientationMagnitude;
        private int m_distPerpendicular;           
        private float m_charSpaceWidth;           
        public List<TextRenderInfo> m_ChunkChars;


        public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth,List<TextRenderInfo> chunkChars)
        {
            this.m_text = txt;
            this.m_startLocation = startLoc;
            this.m_endLocation = endLoc;
            this.m_charSpaceWidth = charSpaceWidth;                
            this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize();
            this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0);
            this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2];                
            this.m_ChunkChars = chunkChars;

        }


        public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare)
        {
            return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular;
        }


    }

    public class SearchResult
    {
        public int iPosX;
        public int iPosY;

        public SearchResult(TextRenderInfo aCharcter, float fPageSizeY)
        {
            //Get position of upperLeft coordinate
            Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint();
            //PosX
            float fPosX = vTopLeft[Vector.I1]; 
            //PosY
            float fPosY = vTopLeft[Vector.I2];
            //Transform to mm and get y from top of page
            iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM);
            iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM);
        }
    }

    public class LineInfo
    {            
        public string m_Text;
        public List<TextRenderInfo> m_LineCharsList;

        public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk)
        {                
            this.m_Text = initialTextChunk.m_text;
            this.m_LineCharsList = initialTextChunk.m_ChunkChars;
        }

        public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk)
        {
            m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars);
            this.m_Text += additionalTextChunk.m_text;
        }
    }
}

Answer 3

我知道这是一个非常古老的问题，但下面是我最终要做的事情。只需在此处发布，希望它对其他人有用。

以下代码将告诉您包含搜索文本的行的起始坐标。修改它以提供单词的位置应该不难。注意。我在itextsharp 5.5.11.0上对此进行了测试，并且在某些旧版本上无法工作

如上所述，pdfs没有单词/行或段落的概念。但是我发现LocationTextExtractionStrategy能够很好地分割线条和单词。所以我的解决方案就是基于此。

声明：

此解决方案基于https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/parser/LocationTextExtractionStrategy.cs，该文件有评论说它是开发预览。所以这可能在将来不起作用。

无论如何这里是代码。

using System.Collections.Generic;
using iTextSharp.text.pdf.parser;

namespace Logic
{
    public class LocationTextExtractionStrategyWithPosition : LocationTextExtractionStrategy
    {
        private readonly List<TextChunk> locationalResult = new List<TextChunk>();

        private readonly ITextChunkLocationStrategy tclStrat;

        public LocationTextExtractionStrategyWithPosition() : this(new TextChunkLocationStrategyDefaultImp()) {
        }

        /**
         * Creates a new text extraction renderer, with a custom strategy for
         * creating new TextChunkLocation objects based on the input of the
         * TextRenderInfo.
         * @param strat the custom strategy
         */
        public LocationTextExtractionStrategyWithPosition(ITextChunkLocationStrategy strat)
        {
            tclStrat = strat;
        }


        private bool StartsWithSpace(string str)
        {
            if (str.Length == 0) return false;
            return str[0] == ' ';
        }


        private bool EndsWithSpace(string str)
        {
            if (str.Length == 0) return false;
            return str[str.Length - 1] == ' ';
        }

        /**
         * Filters the provided list with the provided filter
         * @param textChunks a list of all TextChunks that this strategy found during processing
         * @param filter the filter to apply.  If null, filtering will be skipped.
         * @return the filtered list
         * @since 5.3.3
         */

        private List<TextChunk> filterTextChunks(List<TextChunk> textChunks, ITextChunkFilter filter)
        {
            if (filter == null)
            {
                return textChunks;
            }

            var filtered = new List<TextChunk>();

            foreach (var textChunk in textChunks)
            {
                if (filter.Accept(textChunk))
                {
                    filtered.Add(textChunk);
                }
            }

            return filtered;
        }

        public override void RenderText(TextRenderInfo renderInfo)
        {
            LineSegment segment = renderInfo.GetBaseline();
            if (renderInfo.GetRise() != 0)
            { // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to 
                Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise());
                segment = segment.TransformBy(riseOffsetTransform);
            }
            TextChunk tc = new TextChunk(renderInfo.GetText(), tclStrat.CreateLocation(renderInfo, segment));
            locationalResult.Add(tc);
        }


        public IList<TextLocation> GetLocations()
        {

            var filteredTextChunks = filterTextChunks(locationalResult, null);
            filteredTextChunks.Sort();

            TextChunk lastChunk = null;

             var textLocations = new List<TextLocation>();

            foreach (var chunk in filteredTextChunks)
            {

                if (lastChunk == null)
                {
                    //initial
                    textLocations.Add(new TextLocation
                    {
                        Text = chunk.Text,
                        X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
                        Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
                    });

                }
                else
                {
                    if (chunk.SameLine(lastChunk))
                    {
                        var text = "";
                        // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                        if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
                            text += ' ';

                        text += chunk.Text;

                        textLocations[textLocations.Count - 1].Text += text;

                    }
                    else
                    {

                        textLocations.Add(new TextLocation
                        {
                            Text = chunk.Text,
                            X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
                            Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
                        });
                    }
                }
                lastChunk = chunk;
            }

            //now find the location(s) with the given texts
            return textLocations;

        }

    }

    public class TextLocation
    {
        public float X { get; set; }
        public float Y { get; set; }

        public string Text { get; set; }
    }
}

如何调用方法：

        using (var reader = new PdfReader(inputPdf))
            {

                var parser = new PdfReaderContentParser(reader);

                var strategy = parser.ProcessContent(pageNumber, new LocationTextExtractionStrategyWithPosition());

                var res = strategy.GetLocations();

                reader.Close();
             }
                var searchResult = res.Where(p => p.Text.Contains(searchText)).OrderBy(p => p.Y).Reverse().ToList();




inputPdf is a byte[] that has the pdf data

pageNumber is the page where you want to search in

Answer 4

这是在 VB.NET 中使用 LocationTextExtractionStrategy 的方法。

类定义：

Class TextExtractor
    Inherits LocationTextExtractionStrategy
    Implements iTextSharp.text.pdf.parser.ITextExtractionStrategy
    Public oPoints As IList(Of RectAndText) = New List(Of RectAndText)
    Public Overrides Sub RenderText(renderInfo As TextRenderInfo) 'Implements IRenderListener.RenderText
        MyBase.RenderText(renderInfo)

        Dim bottomLeft As Vector = renderInfo.GetDescentLine().GetStartPoint()
        Dim topRight As Vector = renderInfo.GetAscentLine().GetEndPoint() 'GetBaseline

        Dim rect As Rectangle = New Rectangle(bottomLeft(Vector.I1), bottomLeft(Vector.I2), topRight(Vector.I1), topRight(Vector.I2))
        oPoints.Add(New RectAndText(rect, renderInfo.GetText()))
    End Sub

    Private Function GetLines() As Dictionary(Of Single, ArrayList)
        Dim oLines As New Dictionary(Of Single, ArrayList)
        For Each p As RectAndText In oPoints
            Dim iBottom = p.Rect.Bottom

            If oLines.ContainsKey(iBottom) = False Then
                oLines(iBottom) = New ArrayList()
            End If

            oLines(iBottom).Add(p)
        Next

        Return oLines
    End Function

    Public Function Find(ByVal sFind As String) As iTextSharp.text.Rectangle
        Dim oLines As Dictionary(Of Single, ArrayList) = GetLines()

        For Each oEntry As KeyValuePair(Of Single, ArrayList) In oLines
            'Dim iBottom As Integer = oEntry.Key
            Dim oRectAndTexts As ArrayList = oEntry.Value
            Dim sLine As String = ""
            For Each p As RectAndText In oRectAndTexts
                sLine += p.Text
                If sLine.IndexOf(sFind) <> -1 Then
                    Return p.Rect
                End If
            Next
        Next

        Return Nothing
    End Function

End Class

Public Class RectAndText
    Public Rect As iTextSharp.text.Rectangle
    Public Text As String
    Public Sub New(ByVal rect As iTextSharp.text.Rectangle, ByVal text As String)
        Me.Rect = rect
        Me.Text = text
    End Sub
End Class

用法（在找到的文本右侧插入签名框）

Sub EncryptPdf(ByVal sInFilePath As String, ByVal sOutFilePath As String)

        Dim oPdfReader As iTextSharp.text.pdf.PdfReader = New iTextSharp.text.pdf.PdfReader(sInFilePath)
        Dim oPdfDoc As New iTextSharp.text.Document()
        Dim oPdfWriter As PdfWriter = PdfWriter.GetInstance(oPdfDoc, New FileStream(sOutFilePath, FileMode.Create))
        'oPdfWriter.SetEncryption(PdfWriter.STRENGTH40BITS, sPassword, sPassword, PdfWriter.AllowCopy)
        oPdfDoc.Open()

        oPdfDoc.SetPageSize(iTextSharp.text.PageSize.LEDGER.Rotate())

        Dim oDirectContent As iTextSharp.text.pdf.PdfContentByte = oPdfWriter.DirectContent
        Dim iNumberOfPages As Integer = oPdfReader.NumberOfPages
        Dim iPage As Integer = 0

        Dim iBottomMargin As Integer = txtBottomMargin.Text '10
        Dim iLeftMargin As Integer = txtLeftMargin.Text '500
        Dim iWidth As Integer = txtWidth.Text '120
        Dim iHeight As Integer = txtHeight.Text '780

        Dim oStrategy As New parser.SimpleTextExtractionStrategy()


        Do While (iPage < iNumberOfPages)
            iPage += 1
            oPdfDoc.SetPageSize(oPdfReader.GetPageSizeWithRotation(iPage))
            oPdfDoc.NewPage()

            Dim oPdfImportedPage As iTextSharp.text.pdf.PdfImportedPage =
            oPdfWriter.GetImportedPage(oPdfReader, iPage)
            Dim iRotation As Integer = oPdfReader.GetPageRotation(iPage)
            If (iRotation = 90) Or (iRotation = 270) Then
                oDirectContent.AddTemplate(oPdfImportedPage, 0, -1.0F, 1.0F,
                 0, 0, oPdfReader.GetPageSizeWithRotation(iPage).Height)
            Else
                oDirectContent.AddTemplate(oPdfImportedPage, 1.0F, 0, 0, 1.0F, 0, 0)
            End If

            'Dim sPageText As String = parser.PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oStrategy)
            'sPageText = System.Text.Encoding.UTF8.GetString(System.Text.ASCIIEncoding.Convert(System.Text.Encoding.Default, System.Text.Encoding.UTF8, System.Text.Encoding.Default.GetBytes(sPageText)))
            'If txtFind.Text = "" OrElse sPageText.IndexOf(txtFind.Text) <> -1 Then

            Dim oTextExtractor As New TextExtractor()
            PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oTextExtractor) 'Initialize oTextExtractor

            Dim oRect As iTextSharp.text.Rectangle = oTextExtractor.Find(txtFind.Text)
            If oRect IsNot Nothing Then
                Dim iX As Integer = oRect.Left + oRect.Width + iLeftMargin 'Move right
                Dim iY As Integer = oRect.Bottom - iBottomMargin 'Move down

                Dim field As PdfFormField = PdfFormField.CreateSignature(oPdfWriter)
                field.SetWidget(New Rectangle(iX, iY, iX + iWidth, iY + iHeight), PdfAnnotation.HIGHLIGHT_OUTLINE)
                field.FieldName = "myEmptySignatureField" & iPage
                oPdfWriter.AddAnnotation(field)
            End If

        Loop

        oPdfDoc.Close()

    End Sub

在Itextsharp中使用ITextExtractionStrategy和LocationTextExtractionStrategy获取字符串坐标

4 个答案:

声明：