Question

我正在使用带有表格的PDF。主要目标是在ExcelSheet中反映类似的表结构。

使用iTextSharp或PDFSharp读取PDF流我可以通过丢失表的结构获得纯文本，就像在纯文本中一样，以前具有文本元素的坐标值的流被剥离。

如何使用坐标处理流以将文本值放在excel中的确切位置

Answer 1

我遇到了将PDF的表格部分导入Excel的问题。我做了以下方式：

手动打开PDF，全选并复制
手动更改为Excel
启动一个VBA，它读取剪贴板，解析数据并写出表格

问题在于缓冲区中的数据不是水平排列的 - 正如您所期望的那样 - 而是垂直排列的。所以我不得不围绕这个开发一些代码。我使用类模块来强制执行“下一个单词”，“下一行”，“搜索单词”等功能。

如果有帮助，我很乐意分享这段代码。

修改的

我使用MSForms.DataObject来阅读剪贴板。创建对Microsoft Forms 2.0对象库（... \ system32 \ FM20.DLL）的引用后，创建一个名为ClipClass的新类模块，并将以下代码放入：

Public P As Integer ' line pointer Public T As String ' total text buffer Public L As String ' current line Public Property Get FirstLine() As String P = 1 FirstLine = NextLine() End Property Public Property Get NextLine() As String L = "" Do Until Mid(T, P, 2) = vbCrLf L = L & Mid(T, P, 1) P = P + 1 Loop NextLine = L P = P + 2 End Property Public Property Get FindLine(Arg As String) As String Dim Tmp As String Tmp = FirstLine() Do Until Tmp = Arg Tmp = NextLine() Loop FindLine = Tmp End Property Private Sub Class_Initialize() Dim Buf As MSForms.DataObject Set Buf = New MSForms.DataObject ' this object interfaces with the clipboard Buf.GetFromClipboard ' copy Clipboard to Object T = Buf.GetText ' copy text from Object to string var L = "" P = 1 Set Buf = Nothing ' clean up End Sub

这为您提供了查找字符串和读出行所需的所有功能。现在为了有趣的部分....在我的情况下，我在PDF中有一个常量字符串，它始终位于第一个表格单元格上方3行;并且所有表格单元格在文本缓冲区中按col排列。这是通过Excel工作表上的按钮调用的解析器

Sub Parse() Dim C As ClipClass, Tmp As String, WS As Range Dim WSRow As Integer, WSCol As Integer ' initialize Set WS = Worksheets("Table").[A1] Set C = New ClipClass ' this creates the class instance and implicitely ' fires its Initialize() code which grabs the Clipboard ' get to head of table Tmp = C.FindLine("identifying string before table starts") ' advance to one line before first table field - each field is terminated by CRLF Tmp = C.NextLine Tmp = C.NextLine ' PDF table is 3 col's x 7 rows organized col by col For WSCol = 1 To 3 For WSRow = 1 To 7 WS(WSRow, WSCol) = C.NextLine Next WSRow Next WSCol End Sub

Answer 2

为了实现相同目的，首先使用iTextSharp读取PDF（也尝试使用PDFCLown）。具有坐标的各个块是从PDF中获取的。由于PDF遵循类似于Invoice文件的模式，逻辑上相应地获取数据，然后在NPOI的帮助下实现了生成的excel格式。

PDF流优秀

2 个答案: