使用Excel-VBA进行Web爬网(超链接页面问题)

时间:2019-06-14 02:53:08

标签: html excel vba web-scraping

我创建了一个宏,该宏从我们本地的网站上抓取数据,代码无法正常工作,此外,该宏还收集该Page中的表的问题,我想要的是从中收集超链接的数据表“ VO / NGO名称”字段。 enter image description here

这是主表,我想要的其他字段来自单击NGO名称时出现的页面。

enter image description here

由于我是VBA的新手,因此我在阅读在线资料后尝试了代码,但无法获得正确的代码。 First Output应该看起来像这样,以此类推,清单应该由每个NGO组成:

enter image description here

我是VBA开发的新手,这是我的代码:

Option Explicit
Public Sub GetInfo()
    Const URL As String = "https://ngodarpan.gov.in/index.php/home/statewise_ngo/76/35/1"
    Dim html As HTMLDocument, hTable As HTMLTable, ws As Worksheet, headers()
    headers = Array("Sr No.", "Name of VO/NGO", "Address", "City","State","Telephone","Mobile No.","Website","Email")
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .Send
        html.body.innerHTML = .responseText
    End With
    Set hTable = html.querySelector("table.dvdtbl")
    Dim td As Object, tr As Object, r As Long, c As Long
    r = 1
    With ws
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        For Each tr In hTable.getElementsByTagName("tr")
            r = r + 1: c = 1
            If r > 3 Then
                For Each td In tr.getElementsByTagName("td")
                    .Cells(r - 2, c) = IIf(c = 2, "'" & td.innerText, td.innerText)
                    c = c + 1
                Next
            End If
        Next
    End With
End Sub

1 个答案:

答案 0 :(得分:1)

要获得想要的结果,有几件事要做。

  1. 您需要解析每个链接的ID号,以便在发布请求中重复使用。
  2. 您需要解析this link中的csrf令牌才能在发布请求中使用

最后,您必须使用任何json转换器或脚本控件从该json响应中挖掘各个字段。

我的以下尝试可以为您获取json响应。您现在需要做的就是解析json以满足您的要求:

Sub FetchTabularInfo()
    Dim Http As New XMLHTTP60, Html As New HTMLDocument
    Dim col As Variant, icol As New Collection
    Dim csrf As Variant, I&

    With Http
        .Open "GET", "https://ngodarpan.gov.in/index.php/home/statewise_ngo/76/35/1", False
        .send
        Html.body.innerHTML = .responseText
    End With

    With Html.querySelectorAll(".table tr a[onclick^='show_ngo_info']")
        For I = 0 To .Length - 1
            icol.Add Split(Split(.item(I).getAttribute("onclick"), "(""")(1), """)")(0)
        Next I
    End With

    For Each col In icol
        With Http
            .Open "GET", "https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf", False
            .send
            csrf = .responseText
        End With

        csrf = Split(Replace(Split(csrf, ":")(1), """", ""), "}")(0)

        With Http
            .Open "POST", "https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info", False
            .setRequestHeader "X-Requested-With", "XMLHttpRequest"
            .setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
            .send "id=" & col & "&csrf_test_name=" & csrf
        End With

        Debug.Print Http.responseText
    Next col
End Sub

第一条线索的输出:

{"status":1,"infor":{"0":{"UniqueID":"AN\/2017\/0161456","Mobile":"9476076176","Email":"anaportblair@gmail.com","ngo_url":"http:\/\/www.adityanatyaacademy.com","ngo_name":"AdityaNatyaAcademy","pan_updDocId":"220156","reg_updDocId":"221361","Off_phone1":null,"Major_Activities1":".Drama\nJatrapala\nStreetplays\nAwareness Programe"},"issues_working_db":"","operational_states_db":"ANDAMAN & NICOBAR ISLANDS, ","operational_district_db":"ANDAMAN & NICOBAR ISLANDS->South Andaman, "},"member_info":[{"SalCode":null,"FName":"ASHUTOSH KARMAKAR","MName":null,"LName":null,"DesigName":"President","EmailId":"nicorajberg@gmail.com","MobileNo":"9434262953","pan_updDocId":"223392","aadhaar_updDocId":"223393"},{"SalCode":null,"FName":"KAVERI DEBSHARMA","MName":null,"LName":null,"DesigName":"Member","EmailId":"rajeshdebsharma@gmail.com","MobileNo":"9474299901","pan_updDocId":"223400","aadhaar_updDocId":"223401"},{"SalCode":null,"FName":"SATYAJIT BAIN","MName":null,"LName":null,"DesigName":"Asisstant Secretary","EmailId"
:"anaportblair@gmail.com","MobileNo":"9434271746","pan_updDocId":"223408","aadhaar_updDocId":"223409"}],"registeration_info":[{"nr_orgName":"AdityaNatyaAcademy","nr_add":"31 M.G. Road,\nOpp. Sun Sea Resort,\nMiddle Point.","nr_city":"Port Blair","StateName":"ANDAMAN & NICOBAR ISLANDS","reg_name":"Registrar of Companies","TypeDescription":"Registered Societies (Non-Government)","nr_regNo":"888","nr_updDocId":"0","nr_actName":"Society Registration Act 1860","nr_isFcra":"N","fcrano":"","ngo_reg_date":"05-12-1995"}],"source_info":[{"sourcefund":"S","deptt_name":"Directorate of Art and Culture","purpose":"To Promote Art and Culture in Andaman and Nicobar Islands.","datefrom":"2013-04-01","dateto":"2014-03-31","amount_sanctioned":"25000"},{"sourcefund":"S","deptt_name":"Directorate of Art and Culture","purpose":"To promote Art and Culture","datefrom":"2014-04-01","dateto":"2015-03-31","amount_sanctioned":"25000"},{"sourcefund":"S","deptt_name":"Directorate of Art and Culture","purpose":"To promote Art and Cult
ure","datefrom":"2015-04-01","dateto":"2016-03-31","amount_sanctioned":"35000"},{"sourcefund":"S","deptt_name":"Directorate of Art and Culture","purpose":"To promote Art and Culture","datefrom":"2016-04-01","dateto":"2017-03-31","amount_sanctioned":"25000"}]}

为执行上述脚本而添加的参考:

Microsoft Html Object Library
Microsoft xml, v6.0