我如何使用excel vba将网站上的特定数据(名称,详细信息)提取到excel中?

时间:2018-08-07 02:54:34

标签: html excel vba excel-vba web-scraping

我如何使用excel vba将网站上的特定数据(名称,详细信息)提取到excel中?

下面我试图获得处理器和保修:

Option Explicit
Sub GetData()
    Dim objIE As InternetExplorer
    Dim itemELE As Object
    Dim html As IHTMLDocument
    Dim Processor As String
    Dim warranty As String
    Dim y As Integer

    'start a new browser instance
    Set objIE = New InternetExplorer
    'make browser visible
    objIE.Visible = True

    'navigate to page with needed data
    objIE.navigate "https://www.harveynorman.com.sg/computers-tablets-and-gaming/computers/laptops/"
    'wait for page to load
    Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop

    y = 1

    For Each itemELE In objIE.document.getElementsByClassName("expandabaleContent")
        Processor = itemELE.getElementsByTagName("d1")(0).innerText
        warranty = itemELE.getElementsByClassName("d1")(0).getElementsByTagName("a")(0).textContent

        Sheets("Sheet1").Range("A" & y).Value = Processor
        Sheets("Sheet1").Range("B" & y).Value = warranty
        y = y + 1
    Next
End Sub

页面截图:

Webpage

2 个答案:

答案 0 :(得分:4)

对于显示的页面(图像中的 ),您可以发出XMLHTTP (XHR) GET request来获取产品信息,而无需打开缓慢的IE浏览器实例。


有关具体信息:

处理器和保修信息:

如果您查看该页面,则有关处理器和保修的信息将与类名facetedResults-feature-list

关联

class

您可以看到类名,然后看到一个dl标记,其中包含一个dt标记,该标记具有同级dd标记。这些同级dd标签中有两个与处理器和保修信息相关联。

在这种情况下,我使用CSS选择器来捕获所有这些dd标签,这些标签可以简化,以忽略同级dt和父dl标签,仅使用:

.facetedResults-feature-list dd

"."class selector。上面的CSS组合选择说,在类dd的元素中获取facetedResults-feature-list标签


产品标题信息:

titles我使用了另一个CSS选择器:

.facetedResults-title

这是类facetedResults-title的元素。其中包含产品标题。

title


在表格中写出产品标题,处理器和保修信息:

一些数学运算显示出,处理器信息每14重复一次,如果我在处理器的索引中加8,则会得到保修信息。您会看到如何写出每一个细节,这些细节发生在每隔14个重复的索引处。我将dd元素的nodeList与titles的循环组合在一起写出工作表。


VBA:

Option Explicit
Public Sub GetInfo()
    Dim sResponse As String, i As Long, html As New HTMLDocument
    Application.ScreenUpdating = False
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.lenovo.com/sg/en/laptops/c/LAPTOPS", False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
    sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
    Dim titles As Object, targetedInfo As Object, rowCounter As Long
    With html
        .body.innerHTML = sResponse
        Set titles = .querySelectorAll(".facetedResults-title")
        Set targetedInfo = .querySelectorAll(".facetedResults-feature-list dd")
    End With
    With Worksheets("Sheet1")
        For i = 0 To targetedInfo.Length - 1
            If i Mod 14 = 0 Then
                rowCounter = rowCounter + 1
                .Cells(rowCounter, 1) = titles(rowCounter - 1).innerText
                .Cells(rowCounter, 2) = targetedInfo(i).innerText
                .Cells(rowCounter, 3) = targetedInfo(i + 8).innerText
            End If
        Next i          
    End With
    Application.ScreenUpdating = True
End Sub

输出示例:

output sample


更多常规信息:

CSS选择器:

产品信息与'expandableContent facetedResults-expandableContent-features expandableContent-is-collapsed facetedResults-expandableContent-69'类名

相关联

价格与'expandableContent facetedResults-expandableContent-price expandableContent-is-collapsed'类名称相关。

您可以通过传统的.getElementsByClassName选择它们,然后遍历集合,或者,就我而言,使用CSS选择器为类做同样的事情,然后遍历返回的{{ 1}}。

nodeList

相同
.getElementsByClassName("expandableContent facetedResults-expandableContent-features expandableContent-is-collapsed facetedResults-expandableContent-69") 

.querySelectorAll(".expandableContent.facetedResults-expandableContent-features.expandableContent-is-collapsed.facetedResults-expandableContent-69") class selector

标题与类别"."

相关联

VBA:

facetedResults-title

所需参考(VBE>工具>参考):

  1. Microsoft HTML对象库

答案 1 :(得分:2)

Qharr已经提供了一些不错的选择,但是如果您仍然想尝试IE,请参见下面的代码

Option Explicit
Sub GetData()
    Dim objIE As InternetExplorer
    Dim itemELE As Object
    Dim html As IHTMLDocument
    Dim Processor As String
    Dim warranty As String
    Dim y As Integer

    'start a new browser instance
    Set objIE = New InternetExplorer
    'make browser visible
    objIE.Visible = True

    'navigate to page with needed data
    objIE.navigate "https://www.lenovo.com/sg/en/laptops/c/LAPTOPS"
    'wait for page to load

    Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
    Application.Wait Now + TimeSerial(0, 0, 3)


    y = 1


    For Each itemELE In objIE.document.getElementsByClassName("facetedResults-feature-list")

        If InStr(1, itemELE.className, "bundleList", vbTextCompare) = 0 Then
            Application.Wait Now + TimeSerial(0, 0, 2)
            Processor = itemELE.getElementsByTagName("dl")(0).innerText
            warranty = itemELE.getElementsByTagName("dl")(4).innerText

            Sheets("Sheet1").Range("A" & y).Value = Processor
            Sheets("Sheet1").Range("B" & y).Value = warranty
            y = y + 1
        End If
    Next
End Sub

结果

enter image description here