我有一个PDF文件,我正在使用ITextExtractionStrategy.Now从字符串中读取字符串我正在使用像My name is XYZ
这样的子字符串,需要从PDF文件中获取子字符串的直角坐标,但不能这样做它。谷歌搜索我知道LocationTextExtractionStrategy
,但没有得到如何使用它来获得坐标。
这是代码..
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
string getcoordinate="My name is XYZ";
如何使用ITEXTSHARP获取此子字符串的直角坐标。
请帮忙。
答案 0 :(得分:33)
这是一个非常非常简单的实现版本。
在实施之前 非常 非常重要,因为我们知道PDF没有"单词","段落",& #34;句子"等。此外,PDF中的文本不一定是从左到右,从上到下排列,这与非LTR语言无关。短语" Hello World"可以写成PDF格式:
Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)
它也可以写成
Draw Hello World at (10,10)
您需要实现的ITextExtractionStrategy
接口有一个名为RenderText
的方法,可以为PDF中的每个文本块调用一次。请注意我说" chunk"而不是"字"。在上面的第一个例子中,对于这两个单词,该方法将被调用四次。在第二个例子中,对于这两个单词,它将被调用一次。这是要理解的非常重要的部分。 PDF没有文字,因此,iTextSharp也没有文字。 "字"部分是100%由您来解决。
同样沿着这些方向,正如我上面所说,PDF没有段落。要注意这一点的原因是因为PDF无法将文本换行到新行。每当您看到看起来像段落的内容时,您实际上会看到一个全新的文本绘制命令,该命令与前一行具有不同的y
坐标。请参阅this for further discussion。
下面的代码是一个非常简单的实现。对于它,我是已经实现LocationTextExtractionStrategy
的{{1}}的子类。在每次调用ITextExtractionStrategy
时,我找到当前块的矩形(使用Mark's code here)并将其存储以供日后使用。我使用这个简单的帮助类来存储这些块和矩形:
RenderText()
这是子类:
//Helper class that stores our rectangle and text
public class RectAndText {
public iTextSharp.text.Rectangle Rect;
public String Text;
public RectAndText(iTextSharp.text.Rectangle rect, String text) {
this.Rect = rect;
this.Text = text;
}
}
最后是上面的实现:
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
}
}
我不能强调上述 不 采取&#34;字&#34;考虑到,这将取决于你。传递到//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
doc.Add(new Paragraph("This is my sample file"));
doc.Close();
}
}
}
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}
//Loop through each chunk found
foreach (var p in t.myPoints) {
Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}
的{{1}}对象有一个名为TextRenderInfo
的方法,您可以使用该方法获取更多信息。如果您不关心字体中的下延,您可能还想使用RenderText
GetDescentLine()`。
修改强>
(我吃了很棒的午餐,所以我感觉更有帮助。)
这里是GetCharacterRenderInfos()
的更新版本,它执行我在下面的评论所说的内容,即它需要一个字符串来搜索并搜索每个字符串以查找该字符串。由于列出的所有原因,这在某些/多个/大多数/所有情况下都不起作用。如果子串在单个块中多次存在,它也将仅返回第一个实例。连字和变音符号也可能会弄乱这个。
GetBaseline() instead of
您可以像以前一样使用它,但现在构造函数只有一个必需参数:
MyLocationTextExtractionStrategy
答案 1 :(得分:8)
这是一个老问题,但我在此留下我的回复,因为我在网上找不到正确答案。
正如克里斯·哈斯(Chris Haas)所揭露的那样,因为iText处理大块的问题并不容易处理。 Chris在我的大部分测试中失败的代码,因为一个单词通常在不同的块中被分割(他在帖子中警告过)。
要解决这个问题,我采用的策略是:
我在这里留下代码。我用几个文件测试它并且它工作得很好但是在某些情况下可能会失败,因为这个块有点棘手 - &gt;单词转换。
希望对某人有所帮助。
class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
{
private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>();
private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>();
public List<SearchResult> m_SearchResultsList = new List<SearchResult>();
private String m_SearchText;
public const float PDF_PX_TO_MM = 0.3528f;
public float m_PageSizeY;
public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY)
: base()
{
this.m_SearchText = sSearchText;
this.m_PageSizeY = fPageSizeY;
}
private void searchText()
{
foreach (LineInfo aLineInfo in m_LinesTextInfo)
{
int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText);
if (iIndex != -1)
{
TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex);
SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY);
this.m_SearchResultsList.Add(aSearchResult);
}
}
}
private void groupChunksbyLine()
{
LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null;
LocationTextExtractionStrategyEx.LineInfo textInfo = null;
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks)
{
if (textChunk1 == null)
{
textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
this.m_LinesTextInfo.Add(textInfo);
}
else if (textChunk2.sameLine(textChunk1))
{
textInfo.appendText(textChunk2);
}
else
{
textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
this.m_LinesTextInfo.Add(textInfo);
}
textChunk1 = textChunk2;
}
}
public override string GetResultantText()
{
groupChunksbyLine();
searchText();
//In this case the return value is not useful
return "";
}
public override void RenderText(TextRenderInfo renderInfo)
{
LineSegment baseline = renderInfo.GetBaseline();
//Create ExtendedChunk
ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList());
this.m_DocChunks.Add(aExtendedChunk);
}
public class ExtendedTextChunk
{
public string m_text;
private Vector m_startLocation;
private Vector m_endLocation;
private Vector m_orientationVector;
private int m_orientationMagnitude;
private int m_distPerpendicular;
private float m_charSpaceWidth;
public List<TextRenderInfo> m_ChunkChars;
public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth,List<TextRenderInfo> chunkChars)
{
this.m_text = txt;
this.m_startLocation = startLoc;
this.m_endLocation = endLoc;
this.m_charSpaceWidth = charSpaceWidth;
this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize();
this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0);
this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2];
this.m_ChunkChars = chunkChars;
}
public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare)
{
return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular;
}
}
public class SearchResult
{
public int iPosX;
public int iPosY;
public SearchResult(TextRenderInfo aCharcter, float fPageSizeY)
{
//Get position of upperLeft coordinate
Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint();
//PosX
float fPosX = vTopLeft[Vector.I1];
//PosY
float fPosY = vTopLeft[Vector.I2];
//Transform to mm and get y from top of page
iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM);
iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM);
}
}
public class LineInfo
{
public string m_Text;
public List<TextRenderInfo> m_LineCharsList;
public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk)
{
this.m_Text = initialTextChunk.m_text;
this.m_LineCharsList = initialTextChunk.m_ChunkChars;
}
public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk)
{
m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars);
this.m_Text += additionalTextChunk.m_text;
}
}
}
答案 2 :(得分:5)
我知道这是一个非常古老的问题,但下面是我最终要做的事情。只需在此处发布,希望它对其他人有用。
以下代码将告诉您包含搜索文本的行的起始坐标。修改它以提供单词的位置应该不难。 注意。我在itextsharp 5.5.11.0上对此进行了测试,并且在某些旧版本上无法工作
如上所述,pdfs没有单词/行或段落的概念。但是我发现LocationTextExtractionStrategy
能够很好地分割线条和单词。所以我的解决方案就是基于此。
此解决方案基于https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/parser/LocationTextExtractionStrategy.cs,该文件有评论说它是开发预览。所以这可能在将来不起作用。
无论如何这里是代码。
using System.Collections.Generic;
using iTextSharp.text.pdf.parser;
namespace Logic
{
public class LocationTextExtractionStrategyWithPosition : LocationTextExtractionStrategy
{
private readonly List<TextChunk> locationalResult = new List<TextChunk>();
private readonly ITextChunkLocationStrategy tclStrat;
public LocationTextExtractionStrategyWithPosition() : this(new TextChunkLocationStrategyDefaultImp()) {
}
/**
* Creates a new text extraction renderer, with a custom strategy for
* creating new TextChunkLocation objects based on the input of the
* TextRenderInfo.
* @param strat the custom strategy
*/
public LocationTextExtractionStrategyWithPosition(ITextChunkLocationStrategy strat)
{
tclStrat = strat;
}
private bool StartsWithSpace(string str)
{
if (str.Length == 0) return false;
return str[0] == ' ';
}
private bool EndsWithSpace(string str)
{
if (str.Length == 0) return false;
return str[str.Length - 1] == ' ';
}
/**
* Filters the provided list with the provided filter
* @param textChunks a list of all TextChunks that this strategy found during processing
* @param filter the filter to apply. If null, filtering will be skipped.
* @return the filtered list
* @since 5.3.3
*/
private List<TextChunk> filterTextChunks(List<TextChunk> textChunks, ITextChunkFilter filter)
{
if (filter == null)
{
return textChunks;
}
var filtered = new List<TextChunk>();
foreach (var textChunk in textChunks)
{
if (filter.Accept(textChunk))
{
filtered.Add(textChunk);
}
}
return filtered;
}
public override void RenderText(TextRenderInfo renderInfo)
{
LineSegment segment = renderInfo.GetBaseline();
if (renderInfo.GetRise() != 0)
{ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to
Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise());
segment = segment.TransformBy(riseOffsetTransform);
}
TextChunk tc = new TextChunk(renderInfo.GetText(), tclStrat.CreateLocation(renderInfo, segment));
locationalResult.Add(tc);
}
public IList<TextLocation> GetLocations()
{
var filteredTextChunks = filterTextChunks(locationalResult, null);
filteredTextChunks.Sort();
TextChunk lastChunk = null;
var textLocations = new List<TextLocation>();
foreach (var chunk in filteredTextChunks)
{
if (lastChunk == null)
{
//initial
textLocations.Add(new TextLocation
{
Text = chunk.Text,
X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
});
}
else
{
if (chunk.SameLine(lastChunk))
{
var text = "";
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
text += ' ';
text += chunk.Text;
textLocations[textLocations.Count - 1].Text += text;
}
else
{
textLocations.Add(new TextLocation
{
Text = chunk.Text,
X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
});
}
}
lastChunk = chunk;
}
//now find the location(s) with the given texts
return textLocations;
}
}
public class TextLocation
{
public float X { get; set; }
public float Y { get; set; }
public string Text { get; set; }
}
}
如何调用方法:
using (var reader = new PdfReader(inputPdf))
{
var parser = new PdfReaderContentParser(reader);
var strategy = parser.ProcessContent(pageNumber, new LocationTextExtractionStrategyWithPosition());
var res = strategy.GetLocations();
reader.Close();
}
var searchResult = res.Where(p => p.Text.Contains(searchText)).OrderBy(p => p.Y).Reverse().ToList();
inputPdf is a byte[] that has the pdf data
pageNumber is the page where you want to search in
答案 3 :(得分:0)
这是在 VB.NET 中使用 LocationTextExtractionStrategy 的方法。
类定义:
Class TextExtractor
Inherits LocationTextExtractionStrategy
Implements iTextSharp.text.pdf.parser.ITextExtractionStrategy
Public oPoints As IList(Of RectAndText) = New List(Of RectAndText)
Public Overrides Sub RenderText(renderInfo As TextRenderInfo) 'Implements IRenderListener.RenderText
MyBase.RenderText(renderInfo)
Dim bottomLeft As Vector = renderInfo.GetDescentLine().GetStartPoint()
Dim topRight As Vector = renderInfo.GetAscentLine().GetEndPoint() 'GetBaseline
Dim rect As Rectangle = New Rectangle(bottomLeft(Vector.I1), bottomLeft(Vector.I2), topRight(Vector.I1), topRight(Vector.I2))
oPoints.Add(New RectAndText(rect, renderInfo.GetText()))
End Sub
Private Function GetLines() As Dictionary(Of Single, ArrayList)
Dim oLines As New Dictionary(Of Single, ArrayList)
For Each p As RectAndText In oPoints
Dim iBottom = p.Rect.Bottom
If oLines.ContainsKey(iBottom) = False Then
oLines(iBottom) = New ArrayList()
End If
oLines(iBottom).Add(p)
Next
Return oLines
End Function
Public Function Find(ByVal sFind As String) As iTextSharp.text.Rectangle
Dim oLines As Dictionary(Of Single, ArrayList) = GetLines()
For Each oEntry As KeyValuePair(Of Single, ArrayList) In oLines
'Dim iBottom As Integer = oEntry.Key
Dim oRectAndTexts As ArrayList = oEntry.Value
Dim sLine As String = ""
For Each p As RectAndText In oRectAndTexts
sLine += p.Text
If sLine.IndexOf(sFind) <> -1 Then
Return p.Rect
End If
Next
Next
Return Nothing
End Function
End Class
Public Class RectAndText
Public Rect As iTextSharp.text.Rectangle
Public Text As String
Public Sub New(ByVal rect As iTextSharp.text.Rectangle, ByVal text As String)
Me.Rect = rect
Me.Text = text
End Sub
End Class
用法(在找到的文本右侧插入签名框)
Sub EncryptPdf(ByVal sInFilePath As String, ByVal sOutFilePath As String)
Dim oPdfReader As iTextSharp.text.pdf.PdfReader = New iTextSharp.text.pdf.PdfReader(sInFilePath)
Dim oPdfDoc As New iTextSharp.text.Document()
Dim oPdfWriter As PdfWriter = PdfWriter.GetInstance(oPdfDoc, New FileStream(sOutFilePath, FileMode.Create))
'oPdfWriter.SetEncryption(PdfWriter.STRENGTH40BITS, sPassword, sPassword, PdfWriter.AllowCopy)
oPdfDoc.Open()
oPdfDoc.SetPageSize(iTextSharp.text.PageSize.LEDGER.Rotate())
Dim oDirectContent As iTextSharp.text.pdf.PdfContentByte = oPdfWriter.DirectContent
Dim iNumberOfPages As Integer = oPdfReader.NumberOfPages
Dim iPage As Integer = 0
Dim iBottomMargin As Integer = txtBottomMargin.Text '10
Dim iLeftMargin As Integer = txtLeftMargin.Text '500
Dim iWidth As Integer = txtWidth.Text '120
Dim iHeight As Integer = txtHeight.Text '780
Dim oStrategy As New parser.SimpleTextExtractionStrategy()
Do While (iPage < iNumberOfPages)
iPage += 1
oPdfDoc.SetPageSize(oPdfReader.GetPageSizeWithRotation(iPage))
oPdfDoc.NewPage()
Dim oPdfImportedPage As iTextSharp.text.pdf.PdfImportedPage =
oPdfWriter.GetImportedPage(oPdfReader, iPage)
Dim iRotation As Integer = oPdfReader.GetPageRotation(iPage)
If (iRotation = 90) Or (iRotation = 270) Then
oDirectContent.AddTemplate(oPdfImportedPage, 0, -1.0F, 1.0F,
0, 0, oPdfReader.GetPageSizeWithRotation(iPage).Height)
Else
oDirectContent.AddTemplate(oPdfImportedPage, 1.0F, 0, 0, 1.0F, 0, 0)
End If
'Dim sPageText As String = parser.PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oStrategy)
'sPageText = System.Text.Encoding.UTF8.GetString(System.Text.ASCIIEncoding.Convert(System.Text.Encoding.Default, System.Text.Encoding.UTF8, System.Text.Encoding.Default.GetBytes(sPageText)))
'If txtFind.Text = "" OrElse sPageText.IndexOf(txtFind.Text) <> -1 Then
Dim oTextExtractor As New TextExtractor()
PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oTextExtractor) 'Initialize oTextExtractor
Dim oRect As iTextSharp.text.Rectangle = oTextExtractor.Find(txtFind.Text)
If oRect IsNot Nothing Then
Dim iX As Integer = oRect.Left + oRect.Width + iLeftMargin 'Move right
Dim iY As Integer = oRect.Bottom - iBottomMargin 'Move down
Dim field As PdfFormField = PdfFormField.CreateSignature(oPdfWriter)
field.SetWidget(New Rectangle(iX, iY, iX + iWidth, iY + iHeight), PdfAnnotation.HIGHLIGHT_OUTLINE)
field.FieldName = "myEmptySignatureField" & iPage
oPdfWriter.AddAnnotation(field)
End If
Loop
oPdfDoc.Close()
End Sub