Question

我在抓取一些HTML时遇到问题。

以下是我macro正在废弃的URL，以下是代码的摘录：

Set els = IE.Document.getelementsbytagname("a")
    For Each el In els
        If Trim(el.innertext) = "Documents" Then
            colDocLinks.Add el.href
        End If
    Next el

正如您所看到的，如果您打开URL我们会遇到搜索结果;然后宏会在搜索表中找到所有links并将其放入名为Collection的{{1}}

然而，搜索结果在他们的表格colDocLinks上有我想要包含的文件，但他们也有不同类型的动物我不想要包括10-Q文件...

我如何修改循环，以便显式添加仅 10-Q，而集合中没有附加任何内容，不其他比如10-Q / A？

Answer 1

Public WithEvents objIE As InternetExplorer


Sub LaunchIE()
Set objIE = New InternetExplorer

objIE.Visible = True
objIE.Navigate "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=icld&type=10-Q%20&dateb=&owner=exclude&count=20"

End Sub

Private Sub objIE_DocumentComplete(ByVal pDisp As Object, URL As Variant)

Dim localIE As InternetExplorer
Set localIE = pDisp

Dim doc As MSHTML.IHTMLDocument3
Set doc = localIE.Document

Dim tdElements As MSHTML.IHTMLElementCollection
Dim td As MSHTML.IHTMLElement
Set tdElements = doc.getElementsByTagName("td")
For Each td In tdElements

    If td.innerText = "10-Q" Then

        Dim tr As MSHTML.IHTMLElement
        Set tr = td.parentElement

        Dim childrenElements As MSHTML.IHTMLElementCollection
        Dim child As MSHTML.IHTMLElement
        Set childrenElements = tr.Children
        For Each child In childrenElements
            If child.innerText = " Documents" Then
                'Handle found element
            End If
        Next

    End If

Next

End Sub

Answer 2

我会使用正则表达式来查找和提取我正在寻找的确切链接。像这样：

Dim RegEx As RegExp
Set RegEx = New RegExp
Dim match As match

With RegEx
    .IgnoreCase = True
    .Global = True
    .MultiLine = True
End With

RegEx.Pattern = "<td nowrap="nowrap">10-Q</td>.+?<a href=""(.+?)\.htm"">"

For Each match In RegEx.Execute(Selection)
    colDocLinks.Add match
Next

我没有测试上面的正则表达式，所以可能需要一些调整。为此，您需要包含对Microsoft VBScript Regular Expressions 5.5的引用。

擦除HTML的VBA宏会导致一些错误的元素

2 个答案: