我正在尝试使用vba从pdf文件中提取表格并将它们导出到excel。如果一切都按照应有的方式进行,它应该全部自动完成。问题是表格没有标准化。
这是我到目前为止所拥有的。
代码:
With New Scripting.FileSystemObject
With .OpenTextFile(strFileName, 1, False, 0)
If Not .AtEndOfStream Then .SkipLine
Do Until .AtEndOfStream
//do something
Loop
End With
End With
这一切都很棒。但现在我遇到了从文本文件中提取表格的问题。 我想要做的是VBA找到一个字符串,例如“年收入”,然后将数据输出到列中。 (直到桌子结束。)
第一部分并不是很困难(找到某个字符串),但我将如何处理第二部分。文本文件看起来像this Pastebin。问题是文本没有标准化。因此,例如,一些表具有3年列(2010 2011 2012),而一些表仅有两个(或1),一些表在列之间有更多空格,而一些表不包括某些行(例如Capital Asset,net)。 / p>
我正在考虑做这样的事情,但不确定如何在VBA中进行。
我的第一个版本基于Pdf to excel,但在线阅读的人不建议使用OpenFile
,而是FileSystemObject
(即使它看起来要慢得多)。
任何指示让我开始,主要是在第2步?
答案 0 :(得分:2)
您可以通过多种方式剖析文本文件,具体取决于文件文件的复杂程度可能会导致您偏向某种方式。我开始这个,它有点失控......享受。
根据您提供的样本和其他评论,我注意到以下内容。其中一些可能适用于简单的文件,但对于更大更复杂的文件可能会变得笨拙。此外,我可能会使用稍微有效的方法或技巧,但这肯定会让你达到预期的效果。希望这与提供的代码一起使用是有意义的:
InStr
通过查找文本'Table'然后确定您在表中
一旦你知道你在文件的'Table'部分开始
寻找'资产'部分等Split
函数和循环都可以
这份工作。以下代码将从文本文件中提取Assets详细信息,您可以将其修改为提取其他部分。它应该处理多行。希望我已经评论它足够了。如果需要进一步帮助,请查看我将进行编辑。
Sub ReadInTextFile()
Dim fs As Scripting.FileSystemObject, fsFile As Scripting.TextStream
Dim sFileName As String, sLine As String, vYears As Variant
Dim iNoColumns As Integer, ii As Integer, iCount As Integer
Dim bIsTable As Boolean, bIsAssets As Boolean, bIsLiabilities As Boolean, bIsNetAssets As Boolean
Set fs = CreateObject("Scripting.FileSystemObject")
sFileName = "G:\Sample.txt"
Set fsFile = fs.OpenTextFile(sFileName, 1, False)
'Loop through the file as you've already done
Do While fsFile.AtEndOfStream <> True
'Determine flag positions in text file
sLine = fsFile.Readline
Debug.Print VBA.Len(sLine)
'Always skip empty lines (including single spaceS)
If VBA.Len(sLine) > 1 Then
'We've found a new table so we can reset the booleans
If VBA.InStr(1, sLine, "Table") > 0 Then
bIsTable = True
bIsAssets = False
bIsNetAssets = False
bIsLiabilities = False
iNoColumns = 0
End If
'Perhaps you want to also have some sort of way to designate that a table has finished. Like so
If VBA.Instr(1, sLine, "Some text that designates the end of the table") Then
bIsTable = False
End If
'If we're in the table section then we want to read in the data
If bIsTable Then
'Check for your different sections. You could make this constant if your text file allowed it.
If VBA.InStr(1, sLine, "Assets") > 0 And VBA.InStr(1, sLine, "Net") = 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = False
If VBA.InStr(1, sLine, "Liabilities") > 0 Then bIsAssets = False: bIsLiabilities = True: bIsNetAssets = False
If VBA.InStr(1, sLine, "Net Assests") > 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = True
'If we haven't triggered any of these booleans then we're at the column headings
If Not bIsAssets And Not bIsLiabilities And Not bIsNetAssets And VBA.InStr(1, sLine, "Table") = 0 Then
'Trim the current line to remove leading and trailing spaces then use the split function to determine the number of years
vYears = VBA.Split(VBA.Trim$(sLine), " ")
For ii = LBound(vYears) To UBound(vYears)
If VBA.Len(vYears(ii)) > 0 Then iNoColumns = iNoColumns + 1
Next ii
'Now we can redefine some variables to hold the information (you'll want to redim after you've collected the info)
ReDim sAssets(1 To iNoColumns + 1, 1 To 100) As String
ReDim iColumns(1 To iNoColumns) As Integer
Else
If bIsAssets Then
'Skip the heading line
If Not VBA.Trim$(sLine) = "Assets" Then
'Increment the counter
iCount = iCount + 1
'If iCount reaches it's limit you'll have to redim preseve you sAssets array (I'll leave this to you)
If iCount > 99 Then
'You'll find other posts on stackoverflow to do this
End If
'This will happen on the first row, it'll happen everytime you
'hit a $ sign but you could code to only do so the first time
If VBA.InStr(1, sLine, "$") > 0 Then
iColumns(1) = VBA.InStr(1, sLine, "$")
For ii = 2 To iNoColumns
'We need to start at the next character across
iColumns(ii) = VBA.InStr(iColumns(ii - 1) + 1, sLine, "$")
Next ii
End If
'The first part (the name) is simply up to the $ sign (trimmed of spaces)
sAssets(1, iCount) = VBA.Trim$(VBA.Mid$(sLine, 1, iColumns(1) - 1))
For ii = 2 To iNoColumns
'Then we can loop around for the rest
sAssets(ii, iCount) = VBA.Trim$(VBA.Mid$(sLine, iColumns(ii) + 1, iColumns(ii) - iColumns(ii - 1)))
Next ii
'Now do the last column
If VBA.Len(sLine) > iColumns(iNoColumns) Then
sAssets(iNoColumns + 1, iCount) = VBA.Trim$(VBA.Right$(sLine, VBA.Len(sLine) - iColumns(iNoColumns)))
End If
Else
'Reset the counter
iCount = 0
End If
End If
End If
End If
End If
Loop
'Clean up
fsFile.Close
Set fsFile = Nothing
Set fs = Nothing
End Sub
答案 1 :(得分:0)
我无法检查示例数据,因为已删除了PasteBin。基于我可以从问题描述中收集到的信息,在我看来,使用正则表达式可以更容易地解析数据。
为FileSystemObject添加对Scripting Runtime scrrun.dll的引用 添加对Microsoft VBScript Regular Expressions 5.5的引用。 RegExp对象的库。
使用实例化一个RegEx对象 Dim objRE As New RegExp
将Pattern属性设置为“(\ bd {4} \ b){1,3}” 上面的模式应该匹配包含字符串的行: 2010 2010年2011年 2010 2011 2012
年份字符串之间的空格数是无关紧要的,只要至少有一个(因为我们不期望遇到像201020112012这样的字符串)
将Global属性设置为True
捕获的组将在RegEx对象objRE的Execute方法返回的MatchCollection中的各个Match对象中找到。所以声明适当的对象:
Dim objMatches as MatchCollection
Dim objMatch as Match
Dim intMatchCount 'tells you how many year strings were found, if any
假设您已设置FileSystemObject对象并正在扫描文本文件,请将每一行读入变量strLine
首先测试当前行是否包含所寻求的模式:
If objRE.Test(strLine) Then
'do something
Else
'skip over this line
End If
Set objMatches = objRe.Execute(strLine)
intMatchCount = objMatches.Count
For i = 0 To intMatchCount - 1
'processing code such as writing the years as column headings in Excel
Set objMatch = objMatches(i)
e.g. ActiveCell.Value = objMatch.Value
'subsequent lines beneath the line containing the year strings should
'have the amounts, which may be captured in a similar fashion using an
'additional RegExp object and a Pattern such as "(\b\d+\b){1,3}" for
'whole numbers or "(\b\d+\.\d+\b){1,3}" for floats. For currency, you
'can use "(\b\$\d+\.\d{2}\b){1,3}"
Next i
这只是我如何应对这一挑战的大致轮廓。我希望此代码大纲中有一些内容对您有所帮助。
答案 2 :(得分:0)
另一种成功的方法是使用VBA转换为.doc或.docx文件,然后从Word文件中搜索并提取表。它们可以轻松提取到Excel工作表中。转换似乎可以很好地处理表格。但是请注意,它在逐页的基础上工作,因此扩展到页面的表最终会在doc字中成为单独的表。