Web抓取产品价格和规格

时间:2020-05-28 20:30:23

标签: excel vba web-scraping

我是VBA世界的新手,我试图创建一个Excel文件来获取网站上最便宜的价格和名称标签。我创建了一个文件,该文件遍历整个搜索列表,并收集每个产品URL并放置在电子表格中。我现在的挑战是使第二个代码正常工作。它开始获取信息时没有任何问题,但是,在10个或更多URL之后,该宏给出了一个错误,并且开始在以下所有单元格上重复信息。是否有一种方法可以使代码运行缓慢,所以它不会不能解决这个问题?

我将在此处列出我的代码和我正在抓取的网址示例。

这是我一直在使用的代码:

Sub test()
Dim URL As String

Set ie = CreateObject("internetexplorer.application")

For i = 2 To 300

URL = Cells(i, 1).Value

ie.navigate URL
ie.Visible = False

Do While ie.busy And ie.readystate <> "readystate_complete"
DoEvents
Loop

Cells(i, 3) = ie.document.getElementsByTagName("h1")(0).innerText
Cells(i, 4) = ie.document.getElementsByTagName("strong")(0).innerText
Next i

ie.Quit

MsgBox "acabou"

End Sub

<https://www.zoom.com.br/notebook/notebook-vaio-vjf157f11x-b0211s-intel-core-i5-8250u-15-6-8gb-ssd-256-gb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-samsung-chromebook-plus-intel-celeron-3965y-12-2-4gb-emmc-32-gb-chrome-os-touchscreen>
<https://www.zoom.com.br/notebook/notebook-dell-xps-7390-intel-core-i7-10710u-13-3-16gb-ssd-512-gb-windows-10-touchscreen>
<https://www.zoom.com.br/notebook/notebook-dell-i15-3583-a5-intel-core-i7-8565u-15-6-8gb-hd-2-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i5-8250u-15-6-4gb-hd-1-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-dell-i15-7580-a20-intel-core-i7-8550u-15-6-8gb-hd-1-tb-geforce-mx150-windows-10>
<https://www.zoom.com.br/notebook/notebook-dell-i14-3480-u30-intel-core-i5-8265u-14-4gb-hd-1-tb-linux-8-geracao>
<https://www.zoom.com.br/notebook/macbook-pro-apple-muhn2bz-intel-core-i5-13-3-8gb-ssd-128-gb-tela-de-retina>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc150-amd-a4-9120-14-2gb-emmc-32-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-samsung-np930qaa-kw1br-intel-core-i7-8550u-13-3-8gb-ssd-256-gb-windows-10-touchscreen>
<https://www.zoom.com.br/notebook/notebook-acer-a515-51g-58vh-intel-core-i5-7200u-15-6-8gb-hd-1-tb-geforce-940mx>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc222-intel-celeron-dual-core-13-3-4gb-emmc-64-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-pt515-51-788a-intel-core-i7-9750h-15-6-32gb-ssd-1-tb-geforce-rtx-2070-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-a315-53-53ak-intel-core-i5-7200u-15-6-4gb-hd-1-tb-windows-10-7-geracao>
<https://www.zoom.com.br/notebook/notebook-dell-i15-5584-m40-intel-core-i7-8565u-15-6-8gb-hd-2-tb-geforce-mx130-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-a315-41g-r21b-amd-ryzen-5-2500u-15-6-8gb-hd-1-tb-radeon-535-windows-10>
<https://www.zoom.com.br/notebook/notebook-positivo-master-n2140-intel-core-i3-7020u-14-4gb-hd-500-gb-windows-10-7-geracao>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc101-intel-atom-14-1gb-ssd-32-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i5-8250u-15-6-8gb-hd-1-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-acer-an515-51-77fh-intel-core-i7-7700hq-15-6-8gb-hd-1-tb-geforce-gtx-1050-windows-10>
<https://www.zoom.com.br/notebook/notebook-dell-i15-3583-a2yp-intel-core-i5-8265u-15-6-4gb-optane-16-gb-hd-1-tb-windows-10>
<https://www.zoom.com.br/notebook/notebook-asus-g531gt-intel-core-i7-9750h-15-6-16gb-ssd-512-gb-geforce-gtx-1650-windows-10>
<https://www.zoom.com.br/notebook/notebook-vaio-fit-15s-intel-core-i3-7100u-15-6-4gb-hd-1-tb-windows-10-home>
<https://www.zoom.com.br/notebook/notebook-samsung-s50-intel-core-i7-7500u-13-3-8gb-ssd-256-gb-windows-10-style>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i3-7020u-15-6-4gb-ssd-120-gb-windows-10-7-geracao>

3 个答案:

答案 0 :(得分:2)

首先:
始终声明所有变量。要强制执行此操作,请始终在每个模块的第一行中写入Option Explicit。这样,编译器会立即检测到变量名中的错别字。

这是解决您的问题的方法:
IE有时是一个真正的天才。例如,它不希望必须快速连续地在同一实例中处理URL。因此,建议将其踢出内存,然后为每个新的URL重新启动。

要快速重新启动它,一定不要在其设置中设置删除库,缓存等。否则会发生自动化错误。

尝试此宏。使用给定的URL,它可以工作:

Option Explicit

Sub test()

Dim URL As String
Dim ie As Object
Dim i As Long

  For i = 2 To 300

    If i > 14 Then
      ActiveWindow.SmallScroll down:=1
    End If

    URL = ActiveSheet.Cells(i, 1).Value

    Set ie = CreateObject("internetexplorer.application")
    ie.navigate URL
    ie.Visible = False
    Do While ie.readystate <> 4: DoEvents: Loop

    ActiveSheet.Cells(i, 3) = ie.document.getElementsByTagName("h1")(0).innerText
    ActiveSheet.Cells(i, 4) = ie.document.getElementsByTagName("strong")(0).innerText

    ie.Quit
    Set ie = Nothing
  Next i

  MsgBox "acabou"
End Sub

答案 1 :(得分:1)

尝试一下

Sub GetPrices()
    Dim html As MSHTML.HTMLDocument, r As Long
    For r = 1 To 4
        Set html = GetHTML(Cells(r, 1).Value)
        Cells(r, 3).Value = html.querySelector("h1.product-name").innerText
        Cells(r, 4).Value = Replace(Replace(html.querySelector(".product-price").innerText, "a partir de  ", vbNullString), ":( ", "")
        Set html = Nothing
    Next r
End Sub

Function GetHTML(ByVal sURL As String) As HTMLDocument
    Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
    Set http = New MSXML2.XMLHTTP60
    Set html = New MSHTML.HTMLDocument
    With http
        .Open "Get", sURL, False
        .send
        html.body.innerHTML = .responseText
    End With
    Set GetHTML = html
End Function

答案 2 :(得分:0)

在python上,我会做:

  • 您应该首先导入库

从导入睡眠时间开始

有两个功能:

  • 此功能让您每3秒执行一次即可入睡 time.sleep(3)

  • 此功能使您可以在每次执行1到3秒之间随机睡眠 睡眠(randint(1,3)

注意:请注意: -Cookie,因为有时您需要发布请求以在获取之前抓取一些ID。 -标头的语法,并正确设置起点和终点参数。