我在VBA中构建了一个简单的webscraper,它从Google Patents中提取表格,并将innerHTML存储在.txt文件(名称类似于1234567.txt)中,用于约23,000项专利。现在我想分析每个文件的内容。为此,我希望将txt文件导入VBA,以便我可以进行一些字符串搜索,但这似乎很难实现。 我阅读了大约20个将.txt文件导入VBA的解决方案,但是对于我看起来像这样的文件都不起作用:
US6824791 B2 ' There is a shift + Enter here
<TD class="patent-data-table-td citation-patent"><A href="/patents/US7767249">US7767249</A></TD>
<TD class="patent-data-table-td patent-date-value">Jul 25, 2005</TD>
<TD class="patent-data-table-td patent-date-value">Aug 3, 2010</TD>
<TD class="patent-data-table-td ">Hewlett-Packard Development Company, L.P.</TD>
<TD class="patent-data-table-td ">Preparation of nanoparticles</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/US7935853">US7935853</A><SPAN class=patent-tooltip-anchor aria-label="Cited by examiner" data-tooltip-text="Cited by examiner" data-tooltip="Cited by examiner" a="null"> *</SPAN></TD>
<TD class="patent-data-table-td patent-date-value">Oct 8, 2009</TD>
<TD class="patent-data-table-td patent-date-value">May 3, 2011</TD>
<TD class="patent-data-table-td ">Bobelium S.L.</TD>
<TD class="patent-data-table-td ">Micronized composition of a 2,4-disubstituted phenol derivative</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/US8524829">US8524829</A></TD>
<TD class="patent-data-table-td patent-date-value">Jun 17, 2008</TD>
<TD class="patent-data-table-td patent-date-value">Sep 3, 2013</TD>
<TD class="patent-data-table-td ">Brown University Research Foundation</TD>
<TD class="patent-data-table-td ">Methods for micronization of hydrophobic drugs</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/EP2422804A1?cl=en">EP2422804A1</A></TD>
<TD class="patent-data-table-td patent-date-value">Jun 16, 2005</TD>
<TD class="patent-data-table-td patent-date-value">Feb 29, 2012</TD>
<TD class="patent-data-table-td ">Amano Enzyme USA., Ltd.</TD>
<TD class="patent-data-table-td ">Controlled release formulations of enzymes, microorganisms, and antibodies with mucoadhesive polymers</TD></TR></TBODY></TABLE>
' There is a shift + Enter here
因此,尽管文件整齐且重复,但将它们作为单个字符串导入似乎非常困难。我基本上想循环遍历文件,并提取专利号和提到的2个日期(使用Mid
和InStr
)并将它们放在三个不同的列中。这是我认为最好的,但我渴望听到更聪明的建议!
知道每个.txt文件具有不同的长度(行数)可能是相关的,但我可以估计具有99.9%确定性的确切行数(1 (title) + 6 * total number of citations (which I know) - 1 (last <TR> is missing)
提前致谢
西蒙
编辑:我尝试过的一些事例。此代码取自在线资源。我不确定它们应该如何工作,我很可能错误地应用它们。`Sub Text2Excel_Click()
Dim sourcestring as String
sourcestring = GetText("C\users\...\test.txt")
sourcestring = OpenTextFileToString("C\users\...\test.txt")
Function GetText(sFile As String) As String
Dim sText As String
Dim nSourceFile As Integer
''Close any open text files
Close
''Get the number of the next free text file
nSourceFile = FreeFile
''Write the entire file to sText
Open sFile For Input As #nSourceFile
sText = Input$(LOF(1), 1)
Close
GetText = sText
End Function
Function OpenTextFileToString(ByVal strFile As String) As String
' RB Smissaert - Author
Dim hFile As Long
hFile = FreeFile
Open strFile For Input As #hFile
OpenTextFileToString = Input$(LOF(hFile), hFile)
Close #hFile
End Function
我还尝试了以下suggestion:但是这给了我运行时错误424&#34;对象要求&#34;。
答案 0 :(得分:1)
如果您只是想在vba中将文本文件转换为1个字符串,那么您可以使用以下函数。
注意:这需要您添加对 Microsoft Scripting Runtime 的引用。
Public Function ReadFileAsString(ByVal filePath As String) As String
Dim fso As FileSystemObject
Set fso = New FileSystemObject
Set txtstream = fso.OpenTextFile(filePath, ForReading, False)
Dim sourceString As String
Do While Not txtstream.AtEndOfStream
sourceString = sourceString + txtstream.ReadLine
Loop
ReadFileAsString = sourceString
txtstream.Close
Set fso = Nothing
Set txtstream = Nothing
End Function
然后你可以在上面的函数返回的字符串上做InStr()和Mid()等。要查看文本文件,请尝试:
Public Sub Test()
Debug.Print ReadFileAsString("C:\Users\ausername\Desktop\test.txt")
End Sub