我有一个vba模块,用于提取页面中的所有链接。但是,我想忽略某些标记中的所有链接,例如<header>
和<footer>
(及其所有子标记)。谁能告诉我怎么能这样做呢?
Sub Fetch_click()
Dim LinkArr As Variant
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.Navigate Cells(1, 1).Text
While IE.Busy
DoEvents
Wend
Dim i As Integer
i = 3
Set LinkArr = IE.Document.getElementsByTagName("a")
For Each LinkObj In LinkArr
Cells(i, 1).Value = LinkObj.href
i = i + 1
Next
End Sub
谢谢
答案 0 :(得分:2)
我更喜欢使用 Microsoft HTML对象库和 Microsoft Internet Controls库中的对象(添加对两者的引用!),例如。
Sub StartTest()
Dim Browser As SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
' start browser
Set Browser = New SHDocVw.InternetExplorer
Browser.Visible = True
Browser.navigate "www.dauda.at"
Set HTMLDoc = Browser.document
Dim ECol As MSHTML.IHTMLElementCollection
Dim IFld As MSHTML.IHTMLElement
' search all <a> tags
Set ECol = HTMLDoc.getElementsByTagName("a")
For Each IFld In ECol
' etc ...
Next IFld
' clean up
Set IFld = Nothing
Set ECol = Nothing
Set HTMLDoc = Nothing
Browser.Quit
Set Browser = Nothing
End Sub
检查<a>
标记的位置,就像检查IFld.ParentNode.nodeName
以获取封闭父级的标记一样简单。
如果不清楚您的<a>
的嵌套程度有多深,您可以使用递归函数检查下一个更高的父级,一直到文档根目录("#document"
)或包含的{ {1}},例如
"HTML"
...所以在Function BadParentRec(TestFld As MSHTML.IHTMLElement) As Boolean
Dim MyTag As String, MyTempResult As Boolean
BadParentRec = False
MyTag = TestFld.ParentNode.nodeName
' Debug.Print MyTag
If MyTag = "#document" Then
MyTempResult = False ' lowest level is good
ElseIf MyTag = "XXX" Then ' your own criteria for bad tags go here
MyTempResult = True ' send "bad" back up the recursion chain
Else
MyTempResult = BadParentRec(TestFld.parentElement) ' next level down
End If
BadParentRec = MyTempResult
End Function
循环中你会说
For Each