All data not getting imported from website to excel

时间:2019-05-31 11:52:58

标签: html excel vba web-scraping web-site-project

I want to import restaurant data like Restaurant name, phone number & website to excel but unfortunately I am getting only one page (first page) however I want data from any range which I define like page-1 to page-3 or page-2 to page-5 in separate sheets for each page. Sample output file is attached of what output I am getting for the time being. enter image description here

    Sub Webscraping()
        'Declaration
     Dim ie As InternetExplorer
     Dim ht As HTMLDocument
    'Initialization
     Set ie = New InternetExplorer
     ie.Visible = True

'Open a url
ie.navigate ("https://www.yellowpages.com/atlanta-ga/restaurants")

'Set ht = ie.document
'MsgBox ht.getElementsByClassName("ot_lrp_bname_free_center")

'Alternative Approach for wait

Do Until ie.readyState = READYSTATE_COMPLETE
    DoEvents
Loop

'Initialize the document

Set ht = ie.document

'Set elems = ht.getElementsByClassName("list-title")
Set elems = ht.getElementsByClassName("business-name")
'Set elems = ht.getElementsByClassName("website-lrp icon-link ot_lrp_website_text_free_center")

i = 1
For Each elem In elems
    Sheet1.Cells(i, 1).Value = elem.innerText
    i = i + 1

    'Debug.Print (elem.innerText)
Next

Set elems = ht.getElementsByClassName("phone primary")

i = 1
For Each elem In elems
    Sheet1.Cells(i, 2).Value = elem.innerText
    i = i + 1

   'Debug.Print (elem.innerText)
Next
Set elems = ht.getElementsByClassName("links")
i = 1
For Each elem In elems

    Set link = elem.ChildNodes.Item(0)
    Sheet1.Cells(i, 3).Value = link.href
    i = i + 1    
Next

'Set internetdata = ie.document
'Set div_result = internetdata.getElementById("ctl00_gvMain_ctl03_hlTitle")
'Set header_links = div_result.getElementsByTagName("a")
'For Each h In header_links
'Set link = h.ChildNodes.Item(0)
'Worksheets("Stocks").Cells(Range("L" & Rows.Count).End(xlUp).Row + 1, 12) = link.href
 End Sub

This is the work which have been done but struggling to get the required ouput

2 个答案:

答案 0 :(得分:2)

页面被串联到url的末尾。我会在给定的页面范围内循环使用xhr问题请求,并用正则表达式将包含所需信息的json(位于脚本标签之一中)取出。这种方法非常快速,并且可以抵消正则表达式的使用。我还会尽可能地重用对象。

我使用jsonconverter.bas处理json并解析出所需的信息(json中有很多信息,包括评论)。下载.bas并添加到项目中名为JsonConverter的模块后,您需要转到VBE>工具>引用>添加对Microsoft脚本运行时的引用。

Helper函数用于测试要写出的页面是否已经存在或需要创建,以及用于将json结果写出到数组并将数组一次转储到工作表中的页面(效率提高)。保留了该结构,因此如果需要更多信息,例如,可以轻松扩展检索到的信息。评论。

在确保不存在的页面上的作品方面,可能需要做一些工作。我目前仅使用响应的状态码将其过滤掉。


注释:

作为健全性检查,我将使用InternetExplorer转到第1页并提取总结果计数。我将其除以每页的结果(当前为30)以计算总页数。这将给我lbound和ubound值(可能的页面的最小值和最大值)。然后切换到xmlhttp进行实际检索。最后查看其他辅助功能。


代码:

Option Explicit
Public Sub GetRestuarantInfo()
    Dim s As String, re As Object, p As String, page As Long, r As String, json As Object
    Const START_PAGE As Long = 2
    Const END_PAGE As Long = 4
    Const RESULTS_PER_PAGE As Long = 30

    p = "\[{""@context"".*?\]"
    Set re = CreateObject("VBScript.RegExp")

    Application.ScreenUpdating = False

    With CreateObject("MSXML2.XMLHTTP")

        For page = START_PAGE To END_PAGE
            .Open "GET", "https://www.yellowpages.com/atlanta-ga/restaurants?page=" & page, False
            .send
            If .Status = 200 Then
                s = .responseText
                r = GetValue(re, s, p)
                If r <> "Not Found" Then
                    Set json = JsonConverter.ParseJson(r)
                    WriteOutResults page, RESULTS_PER_PAGE, json
                End If
            End If
        Next
    End With
    Application.ScreenUpdating = True
End Sub
Public Sub WriteOutResults(ByVal page As Long, ByVal RESULTS_PER_PAGE As Long, ByVal json As Object)
    Dim sheetName As String, results(), r As Long, headers(), ws As Worksheet
    ReDim results(1 To RESULTS_PER_PAGE, 1 To 3)

    sheetName = "page" & page
    headers = Array("Name", "Website", "Tel")
    If Not WorksheetExists(sheetName) Then
        Set ws = ThisWorkbook.Worksheets.Add
        ws.Name = sheetName
    Else
        ThisWorkbook.Worksheets(sheetName).Cells.ClearContents
    End If
    With ws
        Dim review As Object
        For Each review In json  'collection of dictionaries
            r = r + 1
            results(r, 1) = review("name")
            results(r, 2) = review("url")
            results(r, 3) = review("telephone")
        Next
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        .Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Public Function GetValue(ByVal re As Object, inputString As String, ByVal pattern As String) As String
'https://regex101.com/r/M9oRON/1
    With re
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .pattern = pattern
        If .Test(inputString) Then
            GetValue = .Execute(inputString)(0)
        Else
            GetValue = "Not found"
        End If
    End With
End Function

Public Function WorksheetExists(ByVal sName As String) As Boolean  '@Rory https://stackoverflow.com/a/28473714/6241235
    WorksheetExists = Evaluate("ISREF('" & sName & "'!A1)")
End Function

Helper函数返回页数

'VBE > Tools > References: Microsoft Internet Controls
Public Function GetNumberOfPages(ByVal RESULTS_PER_PAGE As Long) As Variant
    Dim ie As Object, totalResults As Long
    On Error GoTo errhand
    Set ie = CreateObject("InternetExplorer.Application")
    With ie
        .Visible = False
        .Navigate2 "https://www.yellowpages.com/atlanta-ga/restaurants?page=1"

        While .Busy Or .readyState < 4: DoEvents: Wend

        With .document
            totalResults = Replace$(Replace$(.querySelector(".pagination  p").innerText, "We found", vbNullString), "results", vbNullString)
            GetNumberOfPages = totalResults / RESULTS_PER_PAGE
            ie.Quit
            Exit Function
        End With
    End With
errhand:
    If Err.Number <> 0 Then
        GetNumberOfPages = CVErr(xlErrNA)
    End If
End Function

正则表达式说明:

here试试。

enter image description here

答案 1 :(得分:0)

使用VBA的唯一方法是检查是否存在“下一步”按钮,然后单击该按钮(如果有):

enter image description here

这是它的HTML:

<a class="next ajax-page" href="/atlanta-ga/restaurants?page=2" data-page="2" data-analytics="{&quot;click_id&quot;:132}" data-remote="true" data-impressed="1">Next</a>

这不是用VBA进行的“科幻小说”,但是,有一些商业RPA解决方案可以为该任务提供“开箱即用”的功能-UiPath,AutomationAnywhere,BluePrism。 Python的“漂亮汤”也可以做得很好。