使用VBA从网站抓取数据-问题

时间:2020-10-08 02:05:16

标签: vba ms-access web-scraping

我正在尝试从此搜索中的每个设施获取地址,设施类型和其他一些数据。我可以获取搜索结果和设施列表,但无法弄清楚如何从页面中获取数据。

编辑 我在答案中应用了意见,这是新的代码,并且“对象需要”错误在调试行中 我正在尝试点击每个链接,并获取该页面上的名称,地址,设施类型以及其他任何数据

Sub Test()
    Dim ie2 As New InternetExplorer
    'Set ie = New InternetExplorerMedium

    With ie2
        .Visible = True
        .navigate "https://healthapps.state.nj.us/facilities/fsSetSearch.aspx?by=county"

        FacType = "Long-Term Care (Nursing Homes)"
        While .Busy Or .ReadyState < 4: DoEvents: Wend

        With .Document
            .querySelector("#middleContent_cbType_0").Click
            .querySelector("#middleContent_btnGetList").Click
        End With
            
        While .Busy Or .ReadyState < 4: DoEvents: Wend
            

        Pause (2)
        
        Dim list2 As Object, i2  As Long, line1 As String, line2 As String

        Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")
        
        For i2 = 0 To list2.Length - 1
            list2.Item(i2).Click
            Debug.Print .Document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText
            
            While .Busy Or .ReadyState < 4: DoEvents: Wend

            Pause (2)

            address = Replace(Replace(Replace(line1 & " " & line2, "<span id=" & Chr(34) & "middleContent_lbAddress" & Chr(34) & ">", ""), "<br>", ", "), "</span>", "")

            WriteTable .Document.getElementsByTagName("table")(3), .Document.getElementById("middleContent_Menu1").innerText

            .Navigate2 .Document.URL
            While .Busy Or .ReadyState < 4: DoEvents: Wend
            Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']")

        Next
        .Quit                                    '
    End With

End Sub

我在此行收到OBJECT REQUIRED错误

地址= Replace(Replace(Replace(.Document.getElementById(“ middleContent_lbAddress”)。outerHTML,“ ” ,“”),“
”,“,”),“”,“”)

但是我很确定我使用错误的方式来获取数据。因此,即使没有错误,我也不会满足我的需求。

Sub Test()
    Dim ie2 As New InternetExplorer
    'Set ie = New InternetExplorerMedium

    With ie2
        .Visible = False
        .navigate "https://healthapps.state.nj.us/facilities/fsSetSearch.aspx?by=county"

        While .Busy Or .ReadyState < 4: DoEvents: Wend

        With .Document
            .querySelector("#middleContent_cbType_0").Click
            .querySelector("#middleContent_btnGetList").Click
        End With
            
        While .Busy Or .ReadyState < 4: DoEvents: Wend
            
        Dim list2 As Object, i2  As Long
        Set list2 = .Document.querySelectorAll("#main_table")
             
        For i2 = 0 To list2.Length - 1
            list2.Item(i2).Click

            While .Busy Or .ReadyState < 4: DoEvents: Wend

            Pause (2)
    
            If .Document.getElementById("middleContent_lbResultTitle") Is Nothing Then
                Pause (5)
            End If

            If .Document.getElementById("middleContent_lbResultTitle").outerHTML Like "*Long-Term Care Facility*" Then
                FacType = "Long-Term Care (Nursing Homes)"
            End If

            Address = Replace(Replace(Replace(.Document.getElementById("middleContent_lbAddress").outerHTML, "<span id=" & Chr(34) & "middleContent_lbAddress" & Chr(34) & ">", ""), "<br>", ", "), "</span>", "")

            WriteTable .Document.getElementsByTagName("table")(3), .Document.getElementById("middleContent_Menu1").innerText


            .Navigate2 .Document.URL
            While .Busy Or .ReadyState < 4: DoEvents: Wend
            Set list2 = .Document.querySelectorAll("#main_table")

        Next
        .Quit                                    '
    End With
End Sub

1 个答案:

答案 0 :(得分:1)

这是单个节点Dim i As Long, line1 As String, line2 As String, address As String Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']") For i = 0 To list2.Length - 1 line1 = list2.Item(i).NextSibling.NextSibling.NodeValue line2 = list2.Item(i).NextSibling.NextSibling.NextSibling.NodeValue address = line1 & " " & line2 'apply string cleaning here Next 。取而代之的是,假设所有结果都使用相同的结构,例如:

.document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText

这最初将针对每个结果的超链接作为目标,然后使用nextSibling在br元素之间移动以获取地址行1和2。您将需要在地址变量上写一些字符串清除。

如果您决定单击每个超链接,则在详细信息页面上,使用Dim i As Long, address As String, urls(), numLinks As Long Set list2 = .Document.querySelectorAll("[href*='fsFacilityDetails.aspx?item=']") numLinks = List.Length - 1 ReDim urls(0 To numLinks) For i = 0 To numLinks urls(i) = list2.Item(i).href Next For i = 0 To numLinks .navigate2 urls(i) While .Busy Or .ReadyState <> 4: DoEvents: Wend 'time loop maybe goes here address = .Document.querySelector(".infotable tr:nth-of-type(3) td + td").innerText Debug.Print address Next 检索完整地址。

导航到每个页面的示例(获取的检查URL完整且不需要前缀)

  submit() {
    if (this.state.value.trim() != "") {
      const newObj = { task: this.state.value.trim(), id: Date.now };
      this.setState({
        object: newObj,
        object2: [...this.state.object2, newObj],
        value: "",
      });
    } else {
      return;
    }
  }