Question

我有一个PDF文件。

我正在使用iTextSharp类以编程方式从PDF文件中读取文本。它确实读取了Ansi编码文本，但它没有读取IDENTITY-H编码文本。

我的问题是如何使用VB.Net从pdf文件中读取IDENTITY-H文本

以下是我的代码：

公共函数ReadPDFFile（ByVal strSource As String）As String

Dim sbPDFText As New StringBuilder() 'StringBuilder Object To Store read Text

If File.Exists(strSource) Then 'Does File Exist?
    Dim pdfFileReader As New PdfReader(strSource) 'read File
    For intCurrPage As Integer = 1 To pdfFileReader.NumberOfPages 'Loop Through All Pages

        Dim lteStrategy As LocTextExtractionStrategy = New LocTextExtractionStrategy 'Read PDF File Content Blocks
        'Get Text
        Dim strCurrText As String = PdfTextExtractor.GetTextFromPage(pdfFileReader, intCurrPage, lteStrategy)

        sbPDFText.Append(strCurrText) 'Add Text To String Builder
    Next
    pdfFileReader.Close() 'Close File
End If
Return sbPDFText.ToString() 'Return

结束功能

Public Overridable Sub RenderText（ByVal renderInfo As TextRenderInfo）实现ITextExtractionStrategy.RenderText

Dim segment As LineSegment = renderInfo.GetBaseline()
Dim location As New TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth())

If renderInfo.GetText = "" Then
    Console.WriteLine(GetResultantText())
End If
With location
    'Chunk Location:
    Debug.Print(renderInfo.GetText)
    .PosLeft = renderInfo.GetDescentLine.GetStartPoint(Vector.I1)
    .PosRight = renderInfo.GetAscentLine.GetEndPoint(Vector.I1)
    .PosBottom = renderInfo.GetDescentLine.GetStartPoint(Vector.I2)
    .PosTop = renderInfo.GetAscentLine.GetEndPoint(Vector.I2)
    'Chunk Font Size: (Height)
    .curFontSize = .PosTop - segment.GetStartPoint()(Vector.I2)
    'Use Font name  and Size as Key in the SortedList
    Dim StrKey As String = renderInfo.GetFont.PostscriptFontName & .curFontSize.ToString
    'Add this font to ThisPdfDocFonts SortedList if it's not already present
    If 1 = 1 Then
        If Not ThisPdfDocFonts.ContainsKey(StrKey) Then ThisPdfDocFonts.Add(StrKey, renderInfo.GetFont)
        'Store the SortedList index in this Chunk, so we can get it later
        .FontIndex = ThisPdfDocFonts.IndexOfKey(StrKey)
        Console.WriteLine(renderInfo.GetFont.ToString & "-->" & StrKey)
    Else
        'pcbContent.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 9)
        .FontIndex = 3
        .curFontSize = 8
    End If
End With
locationalResult.Add(location)

End Sub

Answer 1

感谢您分享PDF文档。它帮助我们确定您描述的问题不是iTextSharp问题。相反，它是 PDF文档本身的问题。

这个问题没有解决方案，但是我提供这个答案来解释如何在不涉及iTextSharp的情况下发现问题也存在。

在Adobe Reader中打开文档。选择文本“Muyseñoresnuestros”并将其复制/粘贴到文本编辑器中。你会得到“Muyseñoresnuestros”。这是可以使用iTextSharp提取的文本（它可以正常工作）。

现在对“GUARDIAN GLASS EXPRESS，S.L。”一文进行同样的操作。得到以下结果：“”。如您所见，您无法从Adobe Reader正确复制/粘贴文本。这是由于文本存储在PDF中的方式。如果您无法从Adobe Reader复制/粘贴文本，则不应期望能够使用iTextSharp提取文本。 PDF的创建方式不允许提取。

请观看此视频，了解可能的原因：https://www.youtube.com/watch?v=wxGEEv7ibHE

我很抱歉花了这么长时间来弄清楚这一点，结果发现你问的是不可能的事情。您的问题将问题缩小了太多，好像问题是由“IDENTITY-H”编码和iTextSharp引起的。实际上，您正在尝试提取无法提取的文本。

如果您查看PDF中的页面词典，您将找到第一个（也是唯一一个）页面的三个字体资源：

enter image description here

在内容流（下方）小红色箭头中，您会看到使用名称C2_0和C2_1引用的字体显示的两个字符串（十六进制表示法）。顺便提一下，这些字体存储为具有/SubType 0和/Encoding Identity-H的复合字体。这意味着十六进制字符串中使用的字符应与字形的UNICODE值相对应。如果情况并非如此，那你就不走运了。

使用名称/TT0的字体似乎没有问题。

/TT0使用WinAnsiEncoding而其他字体使用Identity-H这一事实无关。丰富的PDF文件包含使用Identity-H的字体，可以使用iTextSharp复制/粘贴或提取文本。不幸的是，PDF的构建方式可能有问题。分析出现问题需要花费太多时间，所以最好的方法是联系给你PDF的人并让他/她修复PDF。

如何使用VB.NET从IDENTITY-H字体中提取PDF文件中的文本

1 个答案: