经过大量搜索后,我正在努力使用VBA从下面的HTML中删除数据。具体来说,我试图提取数据' DATA ONE'和'数据三'来自每个班级=" _Xnb _QJ"在下面的HTML代码中:
<div class="results">
<div class="_s2 _wPc">
<div class="_fW _QJ">
<div class="_Xnb _QJ _Z9b">
<div class="_Xnb _QJ">
<div class="_Xnb _QJ">
<div class="_Xnb _QJ">
<a href="//Extracted URL//">
<span class="_fbb">
<img id="uid_3" //Extracted// >
</span>
<span class="_PHb">
<span class="_MHb">DATA ONE</span>
</span>
<span class="_B6e">
<span class="_x2">DATA TWO</span>
<span class="_Fs"> DATA THREE </span>
我一直在尝试使用getElementsByClassName来获取&#34; _Xnb _QJ&#34;的集合。类,并为每个类使用getElementsByTagName来搜索&#34; _MHb&#34;和&#34; _FS&#34;。我无法按数字顺序挑选孩子,因为这会在&#34; _Xnb ..&#34;之间发生变化。类,但我需要的数据总是附加相同的(_MHb / FS)类标记。
我是VBA / HTML的新手,所以这段代码主要是通过在stackoverflow上的其他地方编辑示例来组装的。我想知道我需要的课程是否属于&#34; href&#34;而不是直接在_Xnb类下面是我无法提取正确数据的原因?
下面我的VBA代码的相关部分 - 当我运行它时,代码似乎运行正常但没有收集数据。
Dim RowNumber As Long
Dim DataOne As String
Dim DataThree As String
Dim QuestionList As IHTMLElementCollection
Dim Question As IHTMLElement
Dim QuestionFields As IHTMLElementCollection
Dim QuestionField As IHTMLElement
RowNumber = 1
Set QuestionList = html.getElementsByClassName("_Xnb _QJ")
For Each Question In QuestionList
Set QuestionFields = Question.getElementsByTagName("SPAN")
For Each QuestionField In QuestionFields
If QuestionField.className = "_MHb" Then
DataOne= QuestionField.innerText
Cells(RowNumber, 1).Value = DataOne
End If
If QuestionField.className = "_Fs" Then
DataThree = QuestionField.innerText
Cells(RowNumber, 2).Value = DataThree
End If
Next QuestionField
RowNumber = RowNumber + 1
Next
Set html = Nothing
MsgBox "Done!"
End Sub
非常感谢任何帮助。
非常感谢
答案 0 :(得分:0)
我建议你研究XPath - 一种基于标准的查询语言,用于处理XML文档。您也可以将它与HTML文档结合使用。它有点神秘,但非常有用,也可以在VBA中使用。
您的示例HTML看起来有点复杂,因为您有多个具有相同类的<div>
标记。由于//Extracted//
标记中的<img>
,它也不是有效的XML。此外,示例中没有结束标记。无论如何,我已经在下面的代码示例中整理了它。
我看了你的问题,并按照这样解释:
从
的<span>
标记中提取文章_MHb
或Fs
;以及它是<div>
_Xnb _QJ
标记的后代
如果是这样,您的XPath查询可以分为三部分构建:
//div[@class='_Xnb _QJ']
含义 - 获取类_Xnb _QJ
的任何div标签。
(//div[@class='_Xnb _QJ'])[last()]
含义 - 只需从第一组中获取最里面的项目(记住你有多个具有相同类的嵌套<div>
标签)。
(//div[@class='_Xnb _QJ'])[last()]//span[@class='_MHb' or @class='_Fs']
含义 - 为<div>
或<span>
等级的_Mhb
代码过滤最里面的_Fs
。
因此,如果包含MSXML库(我认为您已经完成),则可以在VBA中使用XPath。代码如下所示:
Option Explicit
Sub Test()
Dim strXml As String
Dim objXml As New DOMDocument60
Dim strXPath As String
Dim objXmlNodeList As IXMLDOMNodeList
Dim objXmlNode As IXMLDOMNode
'get the sample XML
strXml = GetXml
'load xml to document
If Not objXml.LoadXML(strXml) Then
Debug.Print "Not parsed"
Exit Sub
End If
'apply XPath
'first just let's get the last <div> tag of class _Xnb _QJ
strXPath = "(//div[@class='_Xnb _QJ'])[last()]"
'test that query
Set objXmlNodeList = objXml.SelectNodes(strXPath)
For Each objXmlNode In objXmlNodeList
Debug.Print objXmlNode.XML
Next objXmlNode
'now lets append a filter to only get the <span> texts
strXPath = strXPath & "//span[@class='_MHb' or @class='_Fs']"
'get output nodes by applying query to xml
Set objXmlNodeList = objXml.SelectNodes(strXPath)
For Each objXmlNode In objXmlNodeList
Debug.Print objXmlNode.Text
Next objXmlNode
End Sub
Function GetXml() As String
Dim strXml As String
strXml = ""
strXml = strXml & "<div class=""results"">"
strXml = strXml & " <div class=""_s2 _wPc"">"
strXml = strXml & " <div class=""_fW _QJ"">"
strXml = strXml & " <div class=""_Xnb _QJ _Z9b"">"
strXml = strXml & " <div class=""_Xnb _QJ"">"
strXml = strXml & " <div class=""_Xnb _QJ"">"
strXml = strXml & " <div class=""_Xnb _QJ"">"
strXml = strXml & " <a href=""//Extracted URL//"">"
strXml = strXml & " <span class=""_fbb"">"
strXml = strXml & " <img id=""uid_3"" />"
strXml = strXml & " </span>"
strXml = strXml & " <span class=""_PHb"">"
strXml = strXml & " <span class=""_MHb"">DATA ONE</span>"
strXml = strXml & " </span>"
strXml = strXml & " <span class=""_B6e"">"
strXml = strXml & " <span class=""_x2"">DATA TWO</span>"
strXml = strXml & " <span class=""_Fs""> DATA THREE </span>"
strXml = strXml & " </span>"
strXml = strXml & " </a>"
strXml = strXml & " </div>"
strXml = strXml & " </div>"
strXml = strXml & " </div>"
strXml = strXml & " </div>"
strXml = strXml & " </div>"
strXml = strXml & " </div>"
strXml = strXml & "</div>"
GetXml = strXml
End Function
调试输出如下所示:
<div class="_Xnb _QJ">
<a href="//Extracted URL//">
<span class="_fbb">
<img id="uid_3"/>
</span>
<span class="_PHb">
<span class="_MHb">DATA ONE</span>
</span>
<span class="_B6e">
<span class="_x2">DATA TWO</span>
<span class="_Fs"> DATA THREE </span>
</span>
</a>
</div>
DATA ONE
DATA THREE
这看起来有点复杂 - 但是一旦你尝试了几次就会没问题。