Excel宏将网站中的线程注释绘制到单元格中

时间:2016-06-22 16:19:46

标签: excel excel-vba reddit vba

我正在尝试将Reddit线程注释存储在excel电子表格中,但是我在尝试弄清楚如何执行此操作时遇到了麻烦。我没有太多使用宏从网页获取数据的经验,所以我发现很难弄清楚如何从指定的Reddit线程中抽出每个注释并将其放在一个单元格中,以及它是否是可能的事。

这是我到目前为止所做的:

Sub getRedditData()

Dim x As Long, y As Long
Dim htm As Object

Set htm = CreateObject("htmlFile")

With CreateObject("msxml2.xmlhttp")
    .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
    .send
    htm.body.innerhtml = .responsetext
End With

With htm.getelementbyid("comments")
    Set cellrangex = .Rows(x).Cells.Length - 1
    Set cellrangey = .Rows(x).Cells.Length - 1
    Set cellrange1 = Sheets(1).Cells(x + 1, y + 1).Value
    Set cellrange2 = .Rows(x).Cells(y).innertext

    For x = 0 To cellrangex
        For y = 0 To cellrangey
            cellrange = cellrange2
        Next y
    Next x
End With


End Sub

1 个答案:

答案 0 :(得分:1)

您真的需要使用体面的HTML编辑器分析您正在抓取的网页内容。我建议用chrome导航到相关页面并使用F12打开它的开发者工具。在"元素"选项卡,您可以快速查看哪个HTML正在生成页面的哪个部分(打开页面和开发人员工具彼此相邻)。

当您进入评论时,您会注意到每条评论的文字都位于<p>标记内,并且每个<p>标记位于<div>内。我们正在寻找模式,所以这是一个良好的开端。

您还会注意到,这些<div>代码中的每一个都有class md。 所以...让我们将所有页面<div>标记加载到一个对象中,然后查找那些className包含"md"的标记:

Sub getRedditData()

Dim x As Long, y As Long
Dim htm As Object

Set htm = CreateObject("htmlFile")

With CreateObject("msxml2.xmlhttp")
    .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
    .send
    htm.body.innerhtml = .responsetext
End With

Set Divelements = htm.getElementsByTagName("div")

For Each DivElement In Divelements
    If InStr(1, DivElement.ClassName, "md") Then
        'print contents to the Immediate window for debugging View>>Immediate Window to insure it's up in your VBE
        Debug.Print DivElement.InnerText
    End If
Next

End Sub

这样你就会看到所有评论都停留在立即窗口中(转到视图&gt;&gt;立即窗口),这样你就可以看到这个调试输出。

跳过节点后,您可以向上浏览几个元素并返回树以获取用户名:

Sub getRedditData()

    Dim x As Long, y As Long
    Dim htm As Object

    Set htm = CreateObject("htmlFile")

    With CreateObject("msxml2.xmlhttp")
        .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
        .send
        htm.body.innerhtml = .responsetext
    End With

    Set Divelements = htm.getElementsByTagName("div")


    On Error Resume Next

    For Each divElement In Divelements
        If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
            Set commentEntry = divElement.ParentNode.ParentNode.ParentNode

            'Print the name and the comment
            Debug.Print commentEntry.FirstChild.FirstChild.NextSibling.InnerText & ":", divElement.InnerText

        End If
    Next

End Sub

要将其打印到工作表,只需指向单元格而不是debug.print即时窗口。类似的东西:

Sub getRedditData()

    Dim x As Long, y As Long
    Dim htm As Object
    Dim ws As Worksheet, wsCell As Integer

    'set the worksheet to print to and the first row to start printing.
    Set ws = Sheets("Sheet1")
    wsCell = 1

    Set htm = CreateObject("htmlFile")

    With CreateObject("msxml2.xmlhttp")
        .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
        .send
        htm.body.innerhtml = .responsetext
    End With

    Set Divelements = htm.getElementsByTagName("div")


    On Error Resume Next

    For Each divElement In Divelements
        If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
            Set commentEntry = divElement.ParentNode.ParentNode.ParentNode

            'Print the name and the comment to ws sheet columns 1 and 2
            ws.Cells(wsCell, 1).Value = commentEntry.FirstChild.FirstChild.NextSibling.InnerText
            ws.Cells(wsCell, 2).Value = divElement.InnerText

            'iterate to the next row
            wsCell = wsCell + 1

        End If
    Next
End Sub