Web抓取网页的特定部分

时间:2019-11-14 19:35:33

标签: html excel vba web-scraping

我的网络抓取程序停止工作。所有者更改了html。

我认为需要更改的是Set allElements = doc.getElementsByClassName("el-col el-col-8")行。

我正在尝试从包含“ 52周范围(未定义)”部分的网页中获取文字。我设法从之前和之后抓取文本,但没有找到我需要的部分。一个示例网页是https://www.gurufocus.com/stock/gliba/summary,在我进行一些修整后,我的代码应在单元格中填入“ 38.72-73.63”。

我需要这样做,这样我才能弄清楚它,并在将来需要时进行更改,因此请专注于更正我的固定代码行(假设这是问题!),而不是全新的代码。更复杂的方法,因为它将超越我。 (我的另一组代码行完成了我想要的工作。)

Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim incomeStmtURLs As Variant
Dim sURL As String
Dim lastrow As Long
Dim allRowOfData As Object
Dim i As Integer
Dim allElements As IHTMLElementCollection
Dim anElement As IHTMLElement
Dim aCell As HTMLTableCell

Application.DisplayAlerts = False

Call ToggleEvents(False)

incomeStmtURLs = Range("Sheet1!h1:h2").Value

For i = 1 To UBound(incomeStmtURLs)

    Set wb = CreateObject("internetExplorer.Application")
    sURL = incomeStmtURLs(i, 1)

    wb.navigate sURL
    wb.Visible = False

    While wb.Busy
        Application.Wait Now + #12:00:01 AM#
        DoEvents
    Wend

    Set doc = wb.document

    On Error GoTo err_clear

    Set allElements = doc.getElementsByClassName("el-col el-col-8")
    While allElements.Length = 0
        Application.Wait Now + #12:00:01 AM#
        DoEvents
    Wend
    x = allElements(0).innerText
    ' Debug.Print x
    Sheet6.Cells(i + 1, 2).Value = Trim(Replace(Mid(x, InStr(1, x, "52-Week Range (undefined)") + 25, 25), vbLf, ""))

    Set allElements = doc.getElementsByClassName("fs-x-large fc-primary fw-bolder")

    x = allElements(0).innerText
    Sheet6.Cells(i + 1, 4).Value = Trim(Replace(Mid(x, InStr(1, x, "$") + 1, 7), vbLf, ""))

err_clear:
    If Err <> 0 Then
        Err.Clear
        Resume Next
    End If
    wb.Quit

Next i

Call ToggleEvents(True)

End Sub

Sub ToggleEvents(blnState As Boolean)
    Application.DisplayAlerts = blnState
    Application.EnableEvents = blnState
    If blnState Then Application.CutCopyMode = False
    If blnState Then Application.StatusBar = False
End Sub

1 个答案:

答案 0 :(得分:1)

页面随着向下滚动而动态更新内容。您可能需要将页面的该部分滚动到视图中,然后使用获取类名称为statictics-item的所有元素,然后使用n-2索引,例如没有滚动部分:

Set elems = ie.document.getElementsByClassName("statictics-item")
If elems.length > 1 Then Debug.print elems(elems.length-2).innerText

对于将来的读者(我知道OP不需要这样做):

我会避免整个滚动泡菜,动态html和浏览器,并发出xmlhttp请求,并正则表达式从网页用于更新的javscript对象中提取适当的值。 N.B.我可能会添加对正则表达式匹配位置的验证。

Public Sub test()
    Dim r As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.gurufocus.com/stock/gliba/summary", False
        .send
        r = GetMatches(.responseText, "price52wlow:(.*?),|price52whigh:(.*?),")
        If r <> "NA" Then MsgBox r
    End With
End Sub

Public Function GetMatches(ByVal inputString As String, ByVal sPattern As String) As String
    Dim matches As Object

    With CreateObject("vbscript.regexp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern

        'If .test(inputString) Then
        Set matches = .Execute(inputString)

        If matches.Count = 2 Then
            GetMatches = matches.Item(0).submatches(0) & "-" & matches.Item(1).submatches(1)
        Else
            GetMatches = "NA"
        End If
    End With
End Function

正则表达式:

enter image description here