无法从网站抓取某些字段

时间:2017-04-24 11:47:48

标签: vba split web-scraping responsetext

我已经在vba中编写了一个脚本,我可以使用该脚本解析来自特定网站的“公司名称”,“电话”,“传真”和“电子邮件”,但是在抓取“地址”,“网络”和“名字“我卡住了。我在vba中使用了responsetext和split方法编写了脚本。希望有人能告诉我一个解决方法。

以下是我尝试的内容:

str = Split(http.responseText, " class=""contact-details block dark"">")
y = UBound(str)
    For i = 1 To y
        Cells(x, 1) = Split(Split(str(i), "Company Name:")(1), "<")(0)
        Cells(x, 2) = Split(Split(str(i), "Phone:")(1), "<")(0)
        Cells(x, 3) = Split(Split(str(i), "Fax:")(1), "<")(0)
        Cells(x, 4) = Split(Split(str(i), "mailto:")(1), ">")(0)
        x = x + 1
    Next i

这里是html元素的东西:

<div class="contact-details block dark">
                <h3>Contact Details</h3><p>Company Name: PPEHeads Australia<br>Phone: +61 2 9824 5520<br>Fax: +61 2 9824 5526<br>Web: <a target="_blank" href="http://www.ppeheads.com.au">http://www.ppeheads.com.au</a></p><h4>Address</h4><p>Unit 2 / 4 Reaghs Farm Road<br>MINTO<br>NSW<br>2566</p><h4>Contact</h4><p>Name: Alan Hadfield<br>Phone: +61 2 9824 5520<br>Fax: +61 2 9824 5526<br>Email: <a href="mailto:alan@ppeheads.com.au">alan@ppeheads.com.au</a></p>
            </div>

1 个答案:

答案 0 :(得分:2)

请在下次提供其余代码,因为问题可能不在您认为的位置。幸运的是,我找到了您之前的帖子here

如果你仔细看看你的html元素中有3个p标签:

第一个是联系公司详细信息,您可以通过

获取
Set ele = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(0)

第二个是地址详情,您可以通过

获取
Set ele2 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(1)

第三个是联系人详细信息,您可以通过

获取
Set ele3 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(2)

注意(0),(1),(2)代码末尾的变化,它给出了p标签的出现顺序。

我修改了您之前的代码并对更改进行了评论,以便您可以看到差异:

Sub RestData()
Dim http As New MSXML2.XMLHTTP60
Dim html As New HTMLDocument
Dim ele, ele2, ele3 As Object, post As Object 'declare
Dim TypeDetails() As String
Dim TypeDetails3() As String 'declare
Dim TypeDetail() As String
Dim i As Long, r As Long
With CreateObject("MSXML2.serverXMLHTTP")
    .Open "GET", "http://www.austrade.gov.au/SupplierDetails.aspx?ORGID=ORG0120000508&folderid=1736", False
    .send
    html.body.innerHTML = .responseText
End With

'get all the p elements
Set ele = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(0)
Set ele2 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(1)
Set ele3 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(2)
r = 2

'split from line feed
TypeDetails() = Split(ele.innerText, Chr(10))
TypeDetails3() = Split(ele3.innerText, Chr(10))

'This part goes for Contact Company Details, notice the operator is ": ",not ":"
For i = 0 To UBound(TypeDetails())
    TypeDetail() = Split(TypeDetails(i), ": ")
    Cells(r, 1) = VBA.Trim(TypeDetail(0))
    Cells(r, 2) = VBA.Trim(TypeDetail(1))
    r = r + 1
Next i

'This part goes for Address Details, replaced new line with " " for it to be in the same line
Cells(r, 1) = "Address"
Cells(r, 2) = Replace(ele2.innerText, vbLf, " ")
r = r + 1

'This part goes for Contact Person Details
For i = 0 To UBound(TypeDetails3())
    TypeDetail() = Split(TypeDetails3(i), ": ")
    Cells(r, 1) = VBA.Trim(TypeDetail(0))
    Cells(r, 2) = VBA.Trim(TypeDetail(1))
    r = r + 1
Next i

Set html = Nothing: Set ele = Nothing: Set docs = Nothing
End Sub

我希望这会有所帮助