我一直在玩使用VBS / VBA从网页上抓取数据。
如果它是Javascript我就会轻而易举,但它在VBS / VBA中似乎并不那么直接。
这是我为答案做的一个例子,它有效,但我计划使用getElementByTagName
访问子节点,但我无法弄清楚如何使用它们! HTMLElement
对象没有这些方法。
Sub Scrape()
Dim Browser As InternetExplorer
Dim Document As HTMLDocument
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement
Set Browser = New InternetExplorer
Browser.navigate "http://www.hsbc.com/about-hsbc/leadership"
Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE
DoEvents
Loop
Set Document = Browser.Document
Set Elements = Document.getElementsByClassName("profile-col1")
For Each Element in Elements
Debug.Print "[ name] " & Trim(Element.Children(1).Children(0).innerText)
Debug.Print "[ title] " & Trim(Element.Children(1).Children(1).innerText)
Next Element
Set Document = Nothing
Set Browser = Nothing
End Sub
我一直在查看HTMLElement.document
属性,看它是不是文档的一个片段,但要么难以使用,要么就是我的想法
Dim Fragment As HTMLDocument
Set Element = Document.getElementById("example") ' This works
Set Fragment = Element.document ' This doesn't
这似乎是一个漫长的方式(虽然这通常是vba imo的方式)。 任何人都知道是否有更简单的方法来链接函数?
Document.getElementById("target").getElementsByTagName("tr")
会很棒......
答案 0 :(得分:12)
Sub Scrape()
Dim Browser As InternetExplorer
Dim Document As htmlDocument
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement
Set Browser = New InternetExplorer
Browser.Visible = True
Browser.navigate "http://www.stackoverflow.com"
Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE
DoEvents
Loop
Set Document = Browser.Document
Set Elements = Document.getElementById("hmenus").getElementsByTagName("li")
For Each Element In Elements
Debug.Print Element.innerText
'Questions
'Tags
'Users
'Badges
'Unanswered
'Ask Question
Next Element
Set Document = Nothing
Set Browser = Nothing
End Sub
答案 1 :(得分:4)
我也不喜欢它。
所以使用javascript:
Public Function GetJavaScriptResult(doc as HTMLDocument, jsString As String) As String
Dim el As IHTMLElement
Dim nd As HTMLDOMTextNode
Set el = doc.createElement("INPUT")
Do
el.ID = GenerateRandomAlphaString(100)
Loop Until Document.getElementById(el.ID) Is Nothing
el.Style.display = "none"
Set nd = Document.appendChild(el)
doc.parentWindow.ExecScript "document.getElementById('" & el.ID & "').value = " & jsString
GetJavaScriptResult = Document.getElementById(el.ID).Value
Document.removeChild nd
End Function
Function GenerateRandomAlphaString(Length As Long) As String
Dim i As Long
Dim Result As String
Randomize Timer
For i = 1 To Length
Result = Result & Chr(Int(Rnd(Timer) * 26 + 65 + Round(Rnd(Timer)) * 32))
Next i
GenerateRandomAlphaString = Result
End Function
如果您对此有任何问题,请告诉我。我已将上下文从方法更改为函数。
顺便问一下,您使用的是哪个版本的IE?我怀疑你在< IE8。如果您升级到IE8,我认为它会将shdocvw.dll更新为ieframe.dll,您将能够使用document.querySelector / All。
修改强>
评论回复并非真正的评论: 基本上,在VBA中执行此操作的方法是遍历子节点。问题是你没有得到正确的返回类型。您可以通过创建自己的类(单独)实现IHTMLElement和IHTMLElementCollection来解决这个问题。但是,如果没有得到报酬,这对我来说太痛苦了:)。如果您已确定,请继续阅读VB6 / VBA的Implements关键字。
Public Function getSubElementsByTagName(el As IHTMLElement, tagname As String) As Collection
Dim descendants As New Collection
Dim results As New Collection
Dim i As Long
getDescendants el, descendants
For i = 1 To descendants.Count
If descendants(i).tagname = tagname Then
results.Add descendants(i)
End If
Next i
getSubElementsByTagName = results
End Function
Public Function getDescendants(nd As IHTMLElement, ByRef descendants As Collection)
Dim i As Long
descendants.Add nd
For i = 1 To nd.Children.Length
getDescendants nd.Children.Item(i), descendants
Next i
End Function
答案 2 :(得分:1)
我将使用XMLHTTP请求来更快地检索页面内容。然后,使用querySelectorAll来应用CSS类选择器以按类名称进行抓取就足够容易了。然后,您可以通过标签名称和索引访问子元素。
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As HTMLDocument, elements As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.hsbc.com/about-hsbc/leadership", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
Set elements = .querySelectorAll(".profile-col1")
For i = 0 To elements.Length - 1
Debug.Print String(20, Chr$(61))
Debug.Print elements.item(i).getElementsByTagName("a")(0).innerText
Debug.Print elements.item(i).getElementsByTagName("p")(0).innerText
Debug.Print elements.item(i).getElementsByTagName("p")(1).innerText
Next
End With
End Sub
参考:
VBE>工具>参考> Microsoft HTML对象库
答案 3 :(得分:0)
感谢上面给出了Scrape()子程序的答案。代码完全按照书面形式工作,然后我能够将代码转换为与我试图抓取的特定网站一起工作。
我没有足够的声誉来赞成或评论,但我确实有一些小的改进,以添加到dee的答案:
您需要通过“Tools \ References”将VBA引用添加到“Microsoft HTML Object Library中,以便编译代码。
我注释掉了Browser.Visible行并添加了注释如下
'if you need to debug the browser page, uncomment this line:
'Browser.Visible = True
我在Set Browser = Nothing:
之前添加了一行来关闭浏览器Browser.Quit
再次感谢dee!
ETA:这适用于使用IE9的计算机,但不适用于使用IE8的计算机。任何人都有修复?
自己找到了解决方案,所以回到这里发布它。 ClassName函数在IE9中可用。为了在IE8中工作,你使用querySelectorAll,在你要查找的对象的类名前面有一个点:
'Set repList = doc.getElementsByClassName("reportList") 'only works in IE9, not in IE8
Set repList = doc.querySelectorAll(".reportList") 'this works in IE8+