使用XMLHTTP对象解析VBA中的某些网站

时间:2019-06-10 05:12:12

标签: html excel vba web-scraping

我正在尝试从Wikipedia页面https://en.wikipedia.org/wiki/Abbott_Laboratories中提取“关键人物”字段,并将该值复制到Excel电子表格中。

我设法使用xml http来做到这一点,这是我喜欢的一种提高速度的方法,您可以看到下面的代码正在起作用。

但是,该代码不够灵活,因为Wiki页面的结构可能会更改,例如,它在以下页面上不起作用:https://en.wikipedia.org/wiki/3M

因为tr td结构并不完全相同(关键人物不再是3M页面的第8个TR)

如何改善代码?

Public Sub parsehtml()

Dim http As Object, html As New HTMLDocument, topics As Object, titleElem As Object, detailsElem As Object, topic As HTMLHtmlElement
Dim i As Integer

Set http = CreateObject("MSXML2.XMLHTTP")



http.Open "GET", "https://en.wikipedia.org/wiki/Abbott_Laboratories", False

http.send

html.body.innerHTML = http.responseText

Set topic = html.getElementsByTagName("tr")(8)

Set titleElem = topic.getElementsByTagName("td")(0)

ThisWorkbook.Sheets(1).Cells(1, 1).Value = titleElem.innerText

End Sub

2 个答案:

答案 0 :(得分:2)

如果“关键人物”的表行未固定,那么为什么不为“关键人物”循环表

我通过以下修改进行了测试,发现它可以正常工作。

在声明部分

Dim topics As HTMLTable, Rw As HTMLTableRow

然后最后

html.body.innerHTML = http.responseText
Set topic = html.getElementsByClassName("infobox vcard")(0)

    For Each Rw In topic.Rows
        If Rw.Cells(0).innerText = "Key people" Then
        ThisWorkbook.Sheets(1).Cells(1, 1).Value = Rw.Cells(1).innerText
        Exit For
        End If
    Next

答案 1 :(得分:1)

有一个更好的更快方法。至少对于给定的URL。匹配元素的类名,并索引返回的nodeList。返回的项目较少,元素的路径更短,并且与类名称的匹配比与元素类型的匹配更快。

Option Explicit
Public Sub GetKeyPeople()
    Dim html As HTMLDocument, body As String, urls(), i As Long, keyPeople
    Set html = New HTMLDocument
    urls = Array("https://en.wikipedia.org/wiki/Abbott_Laboratories", "https://en.wikipedia.org/wiki/3M")
    With CreateObject("MSXML2.XMLHTTP")
        For i = LBound(urls) To UBound(urls)
            .Open "GET", urls(i), False
            .send
            html.body.innerHTML = .responseText
            keyPeople = html.querySelectorAll(".agent").item(1).innerText
            ThisWorkbook.Worksheets("Sheet1").Cells(i + 1, 1).Value = keyPeople
        Next
    End With
End Sub