如何导入没有列的文本文件和硬中断到vba?

时间:2014-05-27 16:49:15

标签: vba excel-vba import excel

我在VBA中构建了一个简单的webscraper,它从Google Patents中提取表格,并将innerHTML存储在.txt文件(名称类似于1234567.txt)中,用于约23,000项专利。现在我想分析每个文件的内容。为此,我希望将txt文件导入VBA,以便我可以进行一些字符串搜索,但这似乎很难实现。 我阅读了大约20个将.txt文件导入VBA的解决方案,但是对于我看起来像这样的文件都不起作用:

US6824791 B2 ' There is a shift + Enter here
<TD class="patent-data-table-td citation-patent"><A href="/patents/US7767249">US7767249</A></TD>
<TD class="patent-data-table-td patent-date-value">Jul 25, 2005</TD>
<TD class="patent-data-table-td patent-date-value">Aug 3, 2010</TD>
<TD class="patent-data-table-td ">Hewlett-Packard Development Company, L.P.</TD>
<TD class="patent-data-table-td ">Preparation of nanoparticles</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/US7935853">US7935853</A><SPAN class=patent-tooltip-anchor aria-label="Cited by examiner" data-tooltip-text="Cited by examiner" data-tooltip="Cited by examiner" a="null"> *</SPAN></TD>
<TD class="patent-data-table-td patent-date-value">Oct 8, 2009</TD>
<TD class="patent-data-table-td patent-date-value">May 3, 2011</TD>
<TD class="patent-data-table-td ">Bobelium S.L.</TD>
<TD class="patent-data-table-td ">Micronized composition of a 2,4-disubstituted phenol derivative</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/US8524829">US8524829</A></TD>
<TD class="patent-data-table-td patent-date-value">Jun 17, 2008</TD>
<TD class="patent-data-table-td patent-date-value">Sep 3, 2013</TD>
<TD class="patent-data-table-td ">Brown University Research Foundation</TD>
<TD class="patent-data-table-td ">Methods for micronization of hydrophobic drugs</TD></TR>
<TR>
<TD class="patent-data-table-td citation-patent"><A href="/patents/EP2422804A1?cl=en">EP2422804A1</A></TD>
<TD class="patent-data-table-td patent-date-value">Jun 16, 2005</TD>
<TD class="patent-data-table-td patent-date-value">Feb 29, 2012</TD>
<TD class="patent-data-table-td ">Amano Enzyme USA., Ltd.</TD>
<TD class="patent-data-table-td ">Controlled release formulations of enzymes, microorganisms, and antibodies with mucoadhesive polymers</TD></TR></TBODY></TABLE>
' There is a shift + Enter here

因此,尽管文件整齐且重复,但将它们作为单个字符串导入似乎非常困难。我基本上想循环遍历文件,并提取专利号和提到的2个日期(使用MidInStr)并将它们放在三个不同的列中。这是我认为最好的,但我渴望听到更聪明的建议!

知道每个.txt文件具有不同的长度(行数)可能是相关的,但我可以估计具有99.9%确定性的确切行数(1 (title) + 6 * total number of citations (which I know) - 1 (last <TR> is missing)

提前致谢

西蒙

编辑:我尝试过的一些事例。此代码取自在线资源。我不确定它们应该如何工作,我很可能错误地应用它们。

`Sub Text2Excel_Click()
Dim sourcestring as String

sourcestring = GetText("C\users\...\test.txt")
sourcestring = OpenTextFileToString("C\users\...\test.txt")

Function GetText(sFile As String) As String
   Dim sText As String
   Dim nSourceFile As Integer

   ''Close any open text files
   Close
   ''Get the number of the next free text file
   nSourceFile = FreeFile
   ''Write the entire file to sText
   Open sFile For Input As #nSourceFile
   sText = Input$(LOF(1), 1)
   Close

   GetText = sText
End Function

Function OpenTextFileToString(ByVal strFile As String) As String
' RB Smissaert - Author
Dim hFile As Long
hFile = FreeFile
Open strFile For Input As #hFile
OpenTextFileToString = Input$(LOF(hFile), hFile)
Close #hFile
End Function

我还尝试了以下suggestion:但是这给了我运行时错误424&#34;对象要求&#34;。

1 个答案:

答案 0 :(得分:1)

如果您只是想在vba中将文本文件转换为1个字符串,那么您可以使用以下函数。

注意:这需要您添加对 Microsoft Scripting Runtime 的引用。

Public Function ReadFileAsString(ByVal filePath As String) As String

Dim fso As FileSystemObject
Set fso = New FileSystemObject
Set txtstream = fso.OpenTextFile(filePath, ForReading, False)

Dim sourceString As String
Do While Not txtstream.AtEndOfStream
    sourceString = sourceString + txtstream.ReadLine
Loop

ReadFileAsString = sourceString
txtstream.Close

Set fso = Nothing
Set txtstream = Nothing

End Function

然后你可以在上面的函数返回的字符串上做InStr()和Mid()等。要查看文本文件,请尝试:

Public Sub Test()
    Debug.Print ReadFileAsString("C:\Users\ausername\Desktop\test.txt")
End Sub