我编写了以下函数来读取PDF文件中的文本。它非常接近,但我对所有的操作码都不太熟悉,以确保线间距正确。例如,当我看到" ET"时,我正在插入一个新行。但这似乎并不正确,因为它可能只是文本运行的结束,中间线。有人可以帮我调整解析吗?我的目标类似于Adobe Reader"另存为" - > "文本"
Public Function ReadPDFFile(filePath As String,
Optional maxLength As Integer = 0) As String
Dim sbContents As New StringBuilder
Dim cArrayType As Type = GetType(CArray)
Dim cCommentType As Type = GetType(CComment)
Dim cIntegerType As Type = GetType(CInteger)
Dim cNameType As Type = GetType(CName)
Dim cNumberType As Type = GetType(CNumber)
Dim cOperatorType As Type = GetType(COperator)
Dim cRealType As Type = GetType(CReal)
Dim cSequenceType As Type = GetType(CSequence)
Dim cStringType As Type = GetType(CString)
Dim opCodeNameType As Type = GetType(OpCodeName)
Dim ReadObject As Action(Of CObject) = Sub(obj As CObject)
Dim objType As Type = obj.GetType
Select Case objType
Case cArrayType
Dim arrObj As CArray = DirectCast(obj, CArray)
For Each member As CObject In arrObj
ReadObject(member)
Next
Case cOperatorType
Dim opObj As COperator = DirectCast(obj, COperator)
Select Case System.Enum.GetName(opCodeNameType, opObj.OpCode.OpCodeName)
Case "ET", "Tx"
sbContents.Append(vbNewLine)
Case "Tj", "TJ"
For Each operand As CObject In opObj.Operands
ReadObject(operand)
Next
Case "QuoteSingle", "QuoteDbl"
sbContents.Append(vbNewLine)
For Each operand As CObject In opObj.Operands
ReadObject(operand)
Next
Case Else
'Do Nothing
End Select
Case cSequenceType
Dim seqObj As CSequence = DirectCast(obj, CSequence)
For Each member As CObject In seqObj
ReadObject(member)
Next
Case cStringType
sbContents.Append(DirectCast(obj, CString).Value)
Case cCommentType, cIntegerType, cNameType, cNumberType, cRealType
'Do Nothing
Case Else
Throw New NotImplementedException(obj.GetType().AssemblyQualifiedName)
End Select
End Sub
Using pd As PdfDocument = PdfReader.Open(filePath, PdfDocumentOpenMode.ReadOnly)
For Each page As PdfPage In pd.Pages
ReadObject(ContentReader.ReadContent(page))
If maxLength > 0 And sbContents.Length >= maxLength Then
If sbContents.Length > maxLength Then
sbContents.Remove(maxLength - 1, sbContents.Length - maxLength)
End If
Exit For
End If
sbContents.Append(vbNewLine)
Next
End Using
Return sbContents.ToString
End Function
答案 0 :(得分:3)
您的代码忽略了几乎所有更改行的操作。您确实考虑'和“,这通常意味着更改行,但在野外很少使用。
在文本对象( BT .. ET )内,您也应注意
要正确理解',“和 T * ,您还应注意
如果您找到多个文字对象( BT .. ET .. BT .. ET ),第二个不一定是新的一行。您应该注意它们之间的特殊图形状态运算符:
您的代码忽略了操作的所有数字参数。你不应该忽视它们,尤其是:
0 -20 Td
开始新行20个单位时,20 0 Td
仍然在同一行,并且只是开始在前一行开始时绘制20个单位的文本。您的代码假设Value
CString
个实例已经包含Unicode编码的字符数据。这种假设通常是不正确的,在文本绘制操作中绘制的PDF字符串中使用的编码由字体决定。因此,您还应该注意
有关详细信息,您首先应该首先研究PDF规范ISO-32000-1,特别是第9章文本,其背景为第8章 Graphics 。