我没有在工作的任何计算机上安装Internet Explorer,因此创建Internet Explorer的对象并使用ie.navigate解析html并搜索标签是不可能的。我的问题是,如何在不使用IE的情况下从帧源自动将标签中的某些数据拉到我的电子表格?答案中的代码示例非常有用:)谢谢
答案 0 :(得分:5)
您可以使用 XMLHTTP 来检索网页的HTML源代码:
Function GetHTML(url As String) As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.Send
GetHTML = .ResponseText
End With
End Function
我不建议将其用作工作表函数,否则每次工作表重新计算时都会重新查询网站URL。有些网站有逻辑可以通过频繁的重复呼叫来检测抓取,并且您的IP可能暂时或永久地禁止,具体取决于网站。
获得源HTML字符串(最好存储在变量中以避免不必要的重复调用),您可以使用基本文本函数来解析字符串以搜索您的标记。
此基本功能将返回 <tag>
和 </tag>
之间的值:
Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String
Dim html As String, pStart As Long, pEnd As Long, o As Integer
html = GetHTML(url)
'remove <> if they exist so we can add our own
If Left(tag, 1) = "<" And Right(tag, 1) = ">" Then
tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1)
End If
' default to Occurrence #1
If occurNum = 0 Then occurNum = 1
pEnd = 1
For o = 1 To occurNum
' find start <tag> beginning at 1 (or after previous Occurence)
pStart = InStr(pEnd, html, "<" & tag & ">", vbTextCompare)
If pStart = 0 Then
getTag = "{Not Found}"
Exit Function
End If
pStart = pStart + Len("<" & tag & ">")
' find first end </tag> after start <tag>
pEnd = InStr(pStart, html, "</" & tag & ">", vbTextCompare)
Next o
'return string between start <tag> & end </tag>
getTag = Mid(html, pStart, pEnd - pStart)
End Function
这只会找到基本的 <tag>
,但您可以添加/删除/更改文字功能以满足您的需求。
Sub findTagExample()
Const testURL = "https://en.wikipedia.org/wiki/Web_scraping"
'search for 2nd occurence of tag: <h2> which is "Contents" :
Debug.Print getTag(testURL, "<h2>", 2)
'...this returns the 8th occurence, "Navigation Menu" :
Debug.Print getTag(testURL, "<h2>", 8)
'...and this returns an HTML <span> containing a title for the 'Legal Issues' section:
Debug.Print getTag("https://en.wikipedia.org/wiki/Web_scraping", "<h2>", 4)
End Sub
答案 1 :(得分:0)
任何进行了网络抓取的人都会熟悉创建Internet Explorer(IE)实例并导航到网址,然后在页面准备就绪后开始使用&#39; Microsoft HTML对象导航DOM图书馆&#39; (MSHTML)类型库。该问题询问IE是否不可用。对于运行Windows 10的盒子,情况也一样。
我曾怀疑可以独立于IE创建一个MSHTML.HTMLDocument实例,但它的创建并不明显。感谢提问者现在问这个问题。答案在于MSHTML.IHTMLDocument4.createDocumentFromUrl方法。一个人需要一个本地文件才能工作(编辑:实际上也可以放一个webby url!)但是我们有一个很好的整洁的Windows API函数叫URLDownloadToFile来下载文件。
此代码在我运行Microsoft Edge的Windows 10机器上运行,而不是Internet Explorer。这是一个重要的发现,并感谢提问者。
Option Explicit
'* Tools->Refernces Microsoft HTML Object Library
'* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _
(ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _
ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
Sub Test()
Dim fso As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Dim sLocalFilename As String
sLocalFilename = Environ$("TMP") & "\urlmon.html"
Dim sURL As String
sURL = "https://stackoverflow.com/users/3607273/s-meaden"
Dim bOk As Boolean
bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)
If bOk Then
If fso.FileExists(sLocalFilename) Then
'* Tools->Refernces Microsoft HTML Object Library
Dim oHtml4 As MSHTML.IHTMLDocument4
Set oHtml4 = New MSHTML.HTMLDocument
Dim oHtml As MSHTML.HTMLDocument
Set oHtml = Nothing
'* IHTMLDocument4.createDocumentFromUrl
'* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")
'* need to wait a little whilst the document parses
'* because it is multithreaded
While oHtml.readyState <> "complete"
DoEvents '* do not comment this out it is required to break into the code if in infinite loop
Wend
Debug.Assert oHtml.readyState = "complete"
Dim sTest As String
sTest = Left$(oHtml.body.outerHTML, 100)
Debug.Assert Len(Trim(sTest)) > 50 '* just testing we got a substantial block of text, feel free to delete
'* page specific logic goes here
Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")
Dim lAnswerLoop As Long
For lAnswerLoop = 0 To htmlAnswers.Length - 1
Dim vAnswerLoop
Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)
Debug.Print vAnswerLoop.outerText
Next
End If
End If
End Sub
感谢您提出这个问题。
P.S。我已经使用TaskList来验证在此代码运行时是否在引擎盖下创建了IExplore.exe。
P.P.S如果您喜欢这个,请在我的Excel Development Platform blog
上查看更多内容