我正在尝试将Reddit线程注释存储在excel电子表格中,但是我在尝试弄清楚如何执行此操作时遇到了麻烦。我没有太多使用宏从网页获取数据的经验,所以我发现很难弄清楚如何从指定的Reddit线程中抽出每个注释并将其放在一个单元格中,以及它是否是可能的事。
这是我到目前为止所做的:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
With htm.getelementbyid("comments")
Set cellrangex = .Rows(x).Cells.Length - 1
Set cellrangey = .Rows(x).Cells.Length - 1
Set cellrange1 = Sheets(1).Cells(x + 1, y + 1).Value
Set cellrange2 = .Rows(x).Cells(y).innertext
For x = 0 To cellrangex
For y = 0 To cellrangey
cellrange = cellrange2
Next y
Next x
End With
End Sub
答案 0 :(得分:1)
您真的需要使用体面的HTML编辑器分析您正在抓取的网页内容。我建议用chrome导航到相关页面并使用F12打开它的开发者工具。在"元素"选项卡,您可以快速查看哪个HTML正在生成页面的哪个部分(打开页面和开发人员工具彼此相邻)。
当您进入评论时,您会注意到每条评论的文字都位于<p>
标记内,并且每个<p>
标记位于<div>
内。我们正在寻找模式,所以这是一个良好的开端。
您还会注意到,这些<div>
代码中的每一个都有class
md
。
所以...让我们将所有页面<div>
标记加载到一个对象中,然后查找那些className
包含"md"
的标记:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
For Each DivElement In Divelements
If InStr(1, DivElement.ClassName, "md") Then
'print contents to the Immediate window for debugging View>>Immediate Window to insure it's up in your VBE
Debug.Print DivElement.InnerText
End If
Next
End Sub
这样你就会看到所有评论都停留在立即窗口中(转到视图&gt;&gt;立即窗口),这样你就可以看到这个调试输出。
跳过节点后,您可以向上浏览几个元素并返回树以获取用户名:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
On Error Resume Next
For Each divElement In Divelements
If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
'Print the name and the comment
Debug.Print commentEntry.FirstChild.FirstChild.NextSibling.InnerText & ":", divElement.InnerText
End If
Next
End Sub
要将其打印到工作表,只需指向单元格而不是debug.print
即时窗口。类似的东西:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Dim ws As Worksheet, wsCell As Integer
'set the worksheet to print to and the first row to start printing.
Set ws = Sheets("Sheet1")
wsCell = 1
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
On Error Resume Next
For Each divElement In Divelements
If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
'Print the name and the comment to ws sheet columns 1 and 2
ws.Cells(wsCell, 1).Value = commentEntry.FirstChild.FirstChild.NextSibling.InnerText
ws.Cells(wsCell, 2).Value = divElement.InnerText
'iterate to the next row
wsCell = wsCell + 1
End If
Next
End Sub