我在抓取一些HTML
时遇到问题。
以下是我macro
正在废弃的URL,以下是代码的摘录:
Set els = IE.Document.getelementsbytagname("a")
For Each el In els
If Trim(el.innertext) = "Documents" Then
colDocLinks.Add el.href
End If
Next el
正如您所看到的,如果您打开URL
我们会遇到搜索结果;然后宏会在搜索表中找到所有links
并将其放入名为Collection
的{{1}}
然而,搜索结果在他们的表格colDocLinks
上有我想要包含的文件,但他们也有不同类型的动物我不想要包括10-Q
文件...
我如何修改循环,以便显式添加仅 10-Q,而集合中没有附加任何内容,不其他比如10-Q / A?
答案 0 :(得分:1)
Public WithEvents objIE As InternetExplorer
Sub LaunchIE()
Set objIE = New InternetExplorer
objIE.Visible = True
objIE.Navigate "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=icld&type=10-Q%20&dateb=&owner=exclude&count=20"
End Sub
Private Sub objIE_DocumentComplete(ByVal pDisp As Object, URL As Variant)
Dim localIE As InternetExplorer
Set localIE = pDisp
Dim doc As MSHTML.IHTMLDocument3
Set doc = localIE.Document
Dim tdElements As MSHTML.IHTMLElementCollection
Dim td As MSHTML.IHTMLElement
Set tdElements = doc.getElementsByTagName("td")
For Each td In tdElements
If td.innerText = "10-Q" Then
Dim tr As MSHTML.IHTMLElement
Set tr = td.parentElement
Dim childrenElements As MSHTML.IHTMLElementCollection
Dim child As MSHTML.IHTMLElement
Set childrenElements = tr.Children
For Each child In childrenElements
If child.innerText = " Documents" Then
'Handle found element
End If
Next
End If
Next
End Sub
答案 1 :(得分:0)
我会使用正则表达式来查找和提取我正在寻找的确切链接。像这样:
Dim RegEx As RegExp
Set RegEx = New RegExp
Dim match As match
With RegEx
.IgnoreCase = True
.Global = True
.MultiLine = True
End With
RegEx.Pattern = "<td nowrap="nowrap">10-Q</td>.+?<a href=""(.+?)\.htm"">"
For Each match In RegEx.Execute(Selection)
colDocLinks.Add match
Next
我没有测试上面的正则表达式,所以可能需要一些调整。为此,您需要包含对Microsoft VBScript Regular Expressions 5.5的引用。