乍一看

Question

我尝试使用iTextSharp库将PDF（.pdf）转换为文本（.txt），但在文本文件中添加单词之间的空格，

PDF格式的内容

＆＃34;位置路径：D：\ PDF文件\ Projects \ 101-14-A \ 2015_10_12 \测试方法\ 121015.pdf＆＃34;

将pdf转换为txt文件后的文本文件中的内容

＆＃34; Lation路径：D：\ PDF Files \ Projects \ 101-14-A \ 2015_10_12 \ Test Methods \ 121015.pdf＆＃34;

有一段时间我会收到文字档案中的内容

＆＃34;位置路径：D：\ PDF文件\ 项目\ 101-14-A \ 2015_10_12 \测试方法\ 121015.pdf＆＃34;

我正在使用以下代码将PDf转换为文本文件

Imports iTextSharp

Private Sub Form_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load

Dim sOut As String = String.Empty

Dim oReader As New iTextSharp.text.pdf.PdfReader(Filepath)

Dim strategy1 As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy

Dim strategy2 As New iTextSharp.text.pdf.parser.LocationTextExtractionStrategy


sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, 1, strategy1)

sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, 1, strategy2)

End Sub

在两个战略中结果是出乎意料的所以，请在其他任何地方重播

Answer 1

乍一看

我使用当前的iTextSharp版本5.5.7检查了OP提供的文件。首先，他们没有显示OP描述的确切问题，因为没有文档甚至包含位置路径行。但是，对于这样的页脚来说，它们确实表现出类似的行为：

$Method Path: D:\Analyst Data\Projects\879...$

这里用SimpleTextExtractionStrategy提取确实返回

M ethod Path: D:\Analyst Data\Projects\879...
Results Path: D:\Analyst Data\Projects\879...

我已在评论中解释了一些背景知识：

在“方法路径”中，两个字母“M”和“e”没有正确地相互跟随，它们实际上几乎重叠！在这种情况下，iText想要表明字母没有正确地相互跟随，并且由于字符串不包含小的后退步骤，因此iText通过空格字符表示这一点。

但是这里的情况与OP描述的情况不同，因为使用LocationTextExtractionStrategy的提取会返回：

Method Path: D:\Analyst Data\Projects\879...
Results Path: D:\Analyst Data\Projects\879...

因此，我建议OP检查他使用的iTextSharp的版本并切换到当前库版本的LocationTextExtractionStrategy。

如果即使在更新到当前版本之后问题仍然存在，请分享使用LocationTextExtractionStrategy可以观察到问题的PDF。

乍一看

由于OP坚持认为应该没有空间，即使在LocationTextExtractionStrategy输出中，我确实在一次又一次搜索之后找到了一些（关于空间下落的提示本来不错...... 。）

我在标题部分找到了“A nalyst”，“O perator”和“W orkstation”。

不幸的是，它们的原因与iText没有任何关系，指向文本块没有正确地相互跟随。相反，原因是是一个空格字符。例如。在“A nalyst”的情况下，内容流看起来像这样：

q
1.0 1.0 1.0 rg
29.28 822.24 39.12 4.8 re
f
BT
/F2 4.515 Tf
0.0 0.0 0.0 rg
0.9998 0.0 0.0 1.0 29.28 823.2 Tm
[( )] TJ
ET
Q
q
BT
/F2 4.515 Tf
0.0 0.0 0.0 rg
0.9998 0.0 0.0 1.0 29.28 823.2 Tm
[(A)30(n)21(a)18(l)65(y)21(s)-36(t)12( )-15(V)137(e)71(r)14(s)-36(i)65(o)-31(n)21(:)65( )-15(1)21(.)-15(6)21(.)-15(2)] TJ
ET
Q

即。绘制白色矩形（重新， f ）后，在完全相同的位置绘制一个空格字符（ [（）] TJ （ 0.9998 0.0 0.0 1.0 29.28 823.2 Tm ）作为分析师的“A”（ [（A）...] TJ ）。

在对文本块进行排序后，空间恰好出现在'A'之后（排序后同一位置的字符顺序是任意的），因此您得到“A nalyst”。

有人可能会考虑更改文本提取策略，以保持字符的原始顺序打印在同一位置，但毕竟可能会破坏其他文档。仍然...

更忠实的提取策略

我们将尝试调整LocationTextExtractionStrategy以保持固定在同一位置的文本块的顺序。

LocationTextExtractionStrategy在这方面明确是任意的：

virtual public int CompareTo(TextChunk rhs) {
    if (this == rhs) return 0; // not really needed, but just in case

    int rslt;
    rslt = CompareInts(orientationMagnitude, rhs.orientationMagnitude);
    if (rslt != 0) return rslt;

    rslt = CompareInts(distPerpendicular, rhs.distPerpendicular);
    if (rslt != 0) return rslt;

    // note: it's never safe to check floating point numbers for equality, and if two chunks
    // are truly right on top of each other, which one comes first or second just doesn't matter
    // so we arbitrarily choose this way.
    rslt = distParallelStart < rhs.distParallelStart ? -1 : 1;

    return rslt;
}

（LocationTextExtractionStrategy.TextChunk）

这种任意选择不仅是不必要的，而且还违反了与此方法实现的IComparable接口相结合的合同：

对于对象A，B和C，必须满足以下条件：

A.CompareTo(A)必须返回零。

如果A.CompareTo(B)返回零，则B.CompareTo(A)必须返回零。

如果A.CompareTo(B)返回零且B.CompareTo(C)返回零，则A.CompareTo(C)必须返回零。

如果A.CompareTo(B)返回非零值，则B.CompareTo(A)必须返回相反符号的值。

如果A.CompareTo(B)返回值x不等于零，B.CompareTo(C)返回与y符号相同的值x，则{{} 1}}必须返回与A.CompareTo(C)和x相同的符号值。

（MSDN on IComparable.CompareTo）

特别是不符合粗体条件。

以下文本提取策略在此不是任意的，而是保持文本块作为最终比较元素到达的顺序。

使用此策略，我提取“Analyst”而不是“A nalyst”，“Operator”而不是“O perator”，以及“Workstation”而不是“W orkstation”。

由于许多使用的变量是私有的，因此该类使用反射。在某些情况下可能不允许这样做。因此，在这种情况下，只需将这些更改工作为class FaithfulLocationTextExtractionStrategy : LocationTextExtractionStrategy { public class FaithfulTextChunk : TextChunk { public FaithfulTextChunk(String stringValue, Vector startLocation, Vector endLocation, float charSpaceWidth, int index) : base(stringValue, startLocation, endLocation, charSpaceWidth) { this.index = index; orientationMagnitudeField = typeof(TextChunk).GetField("orientationMagnitude", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance); distPerpendicularField = typeof(TextChunk).GetField("distPerpendicular", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance); distParallelStartField = typeof(TextChunk).GetField("distParallelStart", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance); } override public int CompareTo(TextChunk rhs) { if (this == rhs) return 0; // not really needed, but just in case if (rhs is FaithfulTextChunk) { int rslt; rslt = CompareInts((int)orientationMagnitudeField.GetValue(this), (int)orientationMagnitudeField.GetValue(rhs)); if (rslt != 0) return rslt; rslt = CompareInts((int)distPerpendicularField.GetValue(this), (int)distPerpendicularField.GetValue(rhs)); if (rslt != 0) return rslt; rslt = CompareFloats((float)distParallelStartField.GetValue(this), (float)distParallelStartField.GetValue(rhs)); if (rslt != 0) return rslt; return CompareInts(index, ((FaithfulTextChunk)rhs).index); } else return base.CompareTo(rhs); } private static int CompareInts(int int1, int int2) { return int1 == int2 ? 0 : int1 < int2 ? -1 : 1; } private static int CompareFloats(float float1, float float2) { return float1 == float2 ? 0 : float1 < float2 ? -1 : 1; } int index; System.Reflection.FieldInfo orientationMagnitudeField, distPerpendicularField, distParallelStartField; } public override void RenderText(TextRenderInfo renderInfo) { LineSegment segment = renderInfo.GetBaseline(); if (renderInfo.GetRise() != 0){ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise()); segment = segment.TransformBy(riseOffsetTransform); } TextChunk location = new FaithfulTextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), nextIndex++); getLocationalResult().Add(location); } public FaithfulLocationTextExtractionStrategy() { locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance); } List<TextChunk> getLocationalResult() { return (List<TextChunk>) locationalResultField.GetValue(this); } System.Reflection.FieldInfo locationalResultField; int nextIndex = 0; }代码的副本。

iTextSharp PDF文本页眉和页脚读取问题

1 个答案:

乍一看

乍一看

更忠实的提取策略