设置Cookie后,使用XML http请求来抓取网站(Excel vba)

时间:2018-09-06 15:56:22

标签: excel vba cookies xmlhttprequest screen-scraping

我想从单个网站页面(带有XML HTTP请求)抓取一个网站(提取产品价格)。但是在运行此脚本之前,我需要首先选择正确的商店(保存在浏览器cookie变量中,或者如果可能,以其他任何方式/请求包括),因为不同商店的价格不同。

我已经创建了一个有效的代码,但是要花很长时间才能运行,所以我认为必须有一种更快,更干净的方法:)。我还需要包括该应用程序以等待网站按照步骤进行操作。

我当前的vba代码:

  • 运行HTTP IE请求以打开网站,然后单击几次以选择所需的商店并将其保存在cookie中(就像站点用户应该这样做一样)
  • 接下来的产品页面将被另一个HTTP IE请求所请求,并提取数据。我发现无法使用XML HTTP请求,因为它不会在正确的商店中使用cookie值,并显示正确的价格。
  • (在下面的示例中)我要的价格是E 1.39,而不是E 1.48(当未使用Cookie值且未选择商店时)。
  • 该cookie值保存在“ www.jumbo.com/cookie/HomeStore”中,该内容保存着存储标签,该存储标签是预先已知的,并且可能在请求中进行硬编码。

选择正确的存储(并将其保存在浏览器cookie中)

   Sub SetStore()

    Dim IE As New SHDocVw.InternetExplorer
    Dim HTMLDoc As MSHTML.HTMLDocument

    Dim HTMLSearchbox As MSHTML.IHTMLElement
    Dim HTMLSearchboxes As MSHTML.IHTMLElementCollection
    Dim HTMLButton As MSHTML.IHTMLElement
    Dim HTMLButtons As MSHTML.IHTMLElementCollection
    Dim HTMLSearchButton As MSHTML.IHTMLElement
    Dim HTMLSearchButtons As MSHTML.IHTMLElementCollection
    Dim HTMLStoreID As MSHTML.IHTMLElement
    Dim HTMLStoreIDs As MSHTML.IHTMLElementCollection
    Dim HTMLSaveStore As MSHTML.IHTMLElement
    Dim HTMLSaveStores As MSHTML.IHTMLElementCollection


   'set on False to hide IE screen
    IE.Visible = True

    'navigate to url with limited content
    IE.navigate "https://www.jumbo.com/content/algemene-voorwaarden/"

    Do While IE.readyState <> READYSTATE_COMPLETE

    Loop
    Set HTMLDoc = IE.document

    Set HTMLButtons = HTMLDoc.getElementsByTagName("button")


    For Each HTMLButton In HTMLButtons

        If HTMLButton.getAttribute("data-jum-action") = "openHomeStoreFinder" Then
           HTMLButton.Click
            Exit For
        End If

     Next HTMLButton


       Application.Wait Now + #12:00:02 AM#

    Set HTMLSearchboxes = HTMLDoc.getElementsByTagName("input")

    For Each HTMLSearchbox In HTMLSearchboxes

     If HTMLSearchbox.getAttribute("id") = "searchTerm__DkKYx4XylsAAAFJktpb2Guy" Then


    'input field store name/location to show search results
    HTMLSearchbox.Value = "Oosterhout"

           Application.Wait Now + #12:00:03 AM#

           HTMLSearchbox.Click

            Exit For
        End If

     Next HTMLSearchbox

     Set HTMLSearchButtons = HTMLDoc.getElementsByTagName("button")

    For Each HTMLSearchButton In HTMLSearchButtons

        If HTMLSearchButton.getAttribute("data-jum-filter") = "search" Then
            HTMLSearchButton.Click

            Exit For
        End If

    Next HTMLSearchButton

    Application.Wait Now + #12:00:05 AM#

    Set HTMLStoreIDs = HTMLDoc.getElementsByTagName("li")

    For Each HTMLStoreID In HTMLStoreIDs


  'oosterhout = YC8KYx4XB88AAAFIDcIYwKxJ
  'nieuwegein = 84IKYx4XziUAAAFInSYYwKrH
  'vaassen = JYYKYx4XC1oAAAFItvcYwKxJ
  'brielle = OG8KYx4XP4wAAAFIlsEYwKxK

     If HTMLStoreID.getAttribute("data-jum-store-id") = "YC8KYx4XB88AAAFIDcIYwKxJ" Then


     HTMLStoreID.Click

      Application.Wait Now + #12:00:03 AM#

          Exit For
      End If


  Next HTMLStoreID

  Set HTMLSaveStores = HTMLDoc.getElementsByTagName("button")


  For Each HTMLSaveStore In HTMLSaveStores

        If HTMLSaveStore.getAttribute("data-jum-action") = "saveHomeStore" Then
            HTMLSaveStore.Click


            Exit For
       End If

    Next HTMLSaveStore


   'IE.Quit

End Sub

从产品页面提取数据(IE HTTP请求,使用cookie存储值)

Sub GetJumboPriceIE()


Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
Dim JumInputs As MSHTML.IHTMLElementCollection
Dim JumInput As MSHTML.IHTMLElement
Dim JumPrice As MSHTML.IHTMLElement
Dim JumboPrice As Double
Dim Price_In_Cents_Tag As String

Dim SKU_tag As String, SKU_url As String

SKU_tag = "173140KST"
SKU_url = "https://www.jumbo.com/lu-bastogne-koeken-original-260g/173140KST/"

IE.Visible = False
   IE.navigate SKU_url



    Do While IE.readyState <> READYSTATE_COMPLETE
    Loop


    Set HTMLDoc = IE.document

    IE.Quit


Set JumInputs = HTMLDoc.getElementsByTagName("input")

Price_In_Cents_Tag = "PriceInCents_" & SKU_tag

Set JumPrice = HTMLDoc.getElementById(Price_In_Cents_Tag)


JumboPrice = JumPrice.getAttribute("value") / 100
Debug.Print JumboPrice


End Sub

上面的代码可以正常工作,但是希望使用下面的XML HTTP请求代码(但是使用正确的存储)。价格为1,39。

从产品页面提取数据(使用XML HTTP请求),但未使用cookie值

Sub GetJumboPriceXML()

Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument

Dim JumInputs As MSHTML.IHTMLElementCollection
Dim JumInput As MSHTML.IHTMLElement
Dim JumPrice As MSHTML.IHTMLElement
Dim JumboPrice As Double
Dim Price_In_Cents_Tag As String

Dim SKU_tag As String, SKU_url As String

SKU_tag = "173140KST"
SKU_url = "https://www.jumbo.com/lu-bastogne-koeken-original-260g/173140KST/"


XMLReq.Open "GET", SKU_url, False
XMLReq.send

If XMLReq.Status <> 200 Then

MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
 Exit Sub
 End If

  HTMLDoc.body.innerHTML = XMLReq.responseText

Set JumInputs = HTMLDoc.getElementsByTagName("input")


Price_In_Cents_Tag = "PriceInCents_" & SKU_tag

Set JumPrice = HTMLDoc.getElementById(Price_In_Cents_Tag)

JumboPrice = JumPrice.getAttribute("value") / 100
Debug.Print JumboPrice



End Sub

此代码未使用正确的商店,并输出了我不希望的价格(已打印出价格1,48)。


总结:

当未选择任何商店(未设置cookie)时,以下网址现在的价格为€1,48。

我希望VB脚本将商店设置为“ Jumbo Oosterhout Nieuwe Bouwlingstraat”,然后抓取产品列表中预定义的商品网址并提取价格(以上网址为€1,39)。

然后将商店设置为其他本地商店“ Jumbo Brielle Thoelaverweg”,然后刮取相同的产品网址列表。上面的网址为€1,48。

您可以通过单击页面右上方的位置图钉图标来选择其他商店。

非常感谢您的帮助

0 个答案:

没有答案