使用Excel VBA进行Web Scraping - 需要帮助

时间:2015-12-06 14:37:39

标签: html excel excel-vba web-scraping vba

作为一个相对较新的VBA编码器,我被赋予了从旅行预订网站(相当具有挑战性的第一项任务)抓取相关票务信息的网络任务。我将一些代码整理在一起,并且能够实际打开目标URL并搜索已定义的参数。我现在卡在实际解析必要信息的部分。

解决

1 个答案:

答案 0 :(得分:0)

您必须在发现元素时隔离元素,然后只依次处理每个匹配元素中的项目。有很多方法可以做到这一点。对我而言,我发现处理getElementsByClassName methodgetElementsByTagName method返回的集合项目最好使用With ... End With statements来处理。这会隔离项目,以便后续内部项目调用仅发现父元素中的匹配项。

With oie
    .navigate "http://english.ctrip.com/trains/List/Index?DepartureCity=shanghai%28%E4%B8%8A%E6%B5%B7%29&ArrivalCity=beijing%28%E5%8C%97%E4%BA%AC%29&DepartDate=12-7-2015&TrainNo=&DepartureCityPinyin=&ArrivalCityPinyin=&DepartureStation=%E4%B8%8A%E6%B5%B7&ArrivalStation=%E5%8C%97%E4%BA%AC&searchboxArg="
    .Visible = True

    Do Until (.readyState = 4 And Not .Busy)
       DoEvents
    Loop

    With oie.document
        For ts = 0 To .getelementsbyclassname("train-seat").Length - 1
            'work with each UL of class 'train-seat'
            With .getelementsbyclassname("train-seat")(ts)
                'work with each LI within the UL
                For li = 0 To .getelementsbytagname("li").Length - 1
                    With .getelementsbytagname("li")(li)
                        'only keep working if there is at least one class-type class and one anchor element
                        If CBool(.getelementsbyclassname("class-type").Length) And _
                           CBool(.getelementsbyclassname("price-num").Length) And _
                           CBool(.getelementsbytagname("a").Length) Then
                            'show first 'class-type' class within the LI
                            Debug.Print .getelementsbyclassname("class-type")(0).innertext
                            'show first 'price-num' class within the LI
                            Debug.Print .getelementsbyclassname("price-num")(0).innertext
                            'the other one are trickier; get the first A element
                            With .getelementsbytagname("a")(0)  'c-btn btn-key class
                                'but can be parsed using Split of the outerHTML on the quote character
                                vPIECEs = Split(.outerhtml, Chr(34))
                                Debug.Print vPIECEs(13) '& vPIECEs(14) '& vPIECEs(7)
                            End With
                        End If
                    End With
                Next li
            End With
        Next ts

    End With
End With

Set oie = Nothing

我已选择使用Debug.Print将输出发送到VBE的立即窗口(Ctrl + G),而不是重复显示MsgBox弹出窗口。