网站使用VBA进行刮痧

时间:2018-06-11 21:34:14

标签: vba excel-vba web-scraping excel

我正在努力从网站上抓取一些数据。我已经介入了,由于某种原因,Question.className与我所拥有的以及我在网页上的检查器上检查的内容不匹配。当我说匹配时,似乎没有分配任何东西。我刚刚开始研究数据抓取,并会感谢任何提示。我没有包含所有代码,因为下面的代码是错误发生的地方。如果之前有人问过,请提前致谢并道歉,但在搜索后我找不到任何内容,以为我会发帖。

网址为 - https://stackoverflow.com

Range("A3").Value = "Question id" 'put heading across the top of row 3
Range("B3").Value = "Votes"
Range("C3").Value = "Views"
Range("D3").Value = "Person"

Dim QuestionList As IHTMLElement
Dim Questions As IHTMLElementCollection
Dim Question As IHTMLElement
Dim RowNumber As Long
Dim QuestionId As String
Dim QuestionFields As IHTMLElementCollection
Dim QuestionField As IHTMLElement
Dim votes As String
Dim views As String
Dim QuestionFieldLinks As IHTMLElementCollection

Set QuestionList = html.getElementById("question-mini-list")
Set Questions = QuestionList.Children

RowNumber = 4

For Each Question In Questions
'if this is the tag containing the question details, process it
If Question.className = "question-summary narrow" Then
'first get and store the question id in first column
QuestionId = Replace(Question.ID, "question-summary-", "")
Cells(RowNumber, 1).Value = CLng(QuestionId)

'get a list of all of the parts of this question, and loop over them
Set QuestionFields = Question.all

For Each QuestionField In QuestionFields
'if this is the question's votes, store it (get rid of any surrounding text)

    If QuestionField.className = "votes" Then
        votes = Replace(QuestionField.innerText, "votes", "")
        votes = Replace(votes, "vote", "")
        Cells(RowNumber, 2).Value = Trim(votes)
    End If

    'likewise for views (getting rid of any text)
    If QuestionField.className = "views" Then
        views = QuestionField.innerText
        views = Replace(views, "views", "")
        views = Replace(views, "view", "")
        Cells(RowNumber, 3).Value = Trim(views)
    End If

    'if this is the bit where author's name is ...
    If QuestionField.className = "started" Then
    'get a list of all elements within, and store the text in the second one
        Set QuestionFieldLinks = QuestionField.all
        Cells(RowNumber, 4).Value = QuestionFieldLinks(2).innerHTML
    End If
Next QuestionField
'go on to next row of worksheet
RowNumber = RowNumber + 1
End If
Next

1 个答案:

答案 0 :(得分:2)

试试这个。它应该获取所需的字段:

Sub GetInformation()
    Dim Http As New XMLHTTP60, Html As New HTMLDocument
    Dim post As HTMLDivElement, URL$, R&

    URL = "https://stackoverflow.com/"

    With Http
        .Open "GET", URL, False
        .send
        Html.body.innerHTML = .responseText
    End With

    For Each post In Html.getElementsByClassName("question-summary")
        R = R + 1: Cells(R, 1) = Split(post.getAttribute("id"), "-")(2)
        Cells(R, 2) = Split(post.querySelector(".votes span").getAttribute("title"), " ")(0)
        Cells(R, 3) = Split(post.querySelector(".views span").getAttribute("title"), " ")(0)
    Next post
End Sub

参考添加到库:

Microsoft XML, v6.0
Microsoft HTML Object Library