从类VBA中提取跨度值

时间:2016-06-13 09:34:41

标签: html excel vba excel-vba

经过大量搜索后,我正在努力使用VBA从下面的HTML中删除数据。具体来说,我试图提取数据' DATA ONE'和'数据三'来自每个班级=" _Xnb _QJ"在下面的HTML代码中:

<div class="results">
  <div class="_s2 _wPc">
    <div class="_fW _QJ">
    <div class="_Xnb _QJ _Z9b">
    <div class="_Xnb _QJ">
    <div class="_Xnb _QJ">
    <div class="_Xnb _QJ">
      <a href="//Extracted URL//">
        <span class="_fbb">
          <img id="uid_3" //Extracted// >
        </span>
        <span class="_PHb">
          <span class="_MHb">DATA ONE</span>
        </span>
        <span class="_B6e">
          <span class="_x2">DATA TWO</span>
          <span class="_Fs"> DATA THREE </span>

我一直在尝试使用getElementsByClassName来获取&#34; _Xnb _QJ&#34;的集合。类,并为每个类使用getElementsByTagName来搜索&#34; _MHb&#34;和&#34; _FS&#34;。我无法按数字顺序挑选孩子,因为这会在&#34; _Xnb ..&#34;之间发生变化。类,但我需要的数据总是附加相同的(_MHb / FS)类标记。

我是VBA / HTML的新手,所以这段代码主要是通过在stackoverflow上的其他地方编辑示例来组装的。我想知道我需要的课程是否属于&#34; href&#34;而不是直接在_Xnb类下面是我无法提取正确数据的原因?

下面我的VBA代码的相关部分 - 当我运行它时,代码似乎运行正常但没有收集数据。

Dim RowNumber As Long
Dim DataOne As String
Dim DataThree As String
Dim QuestionList As IHTMLElementCollection
Dim Question As IHTMLElement
Dim QuestionFields As IHTMLElementCollection
Dim QuestionField As IHTMLElement
RowNumber = 1

Set QuestionList = html.getElementsByClassName("_Xnb _QJ")

For Each Question In QuestionList
Set QuestionFields = Question.getElementsByTagName("SPAN")

For Each QuestionField In QuestionFields
If QuestionField.className = "_MHb" Then
DataOne= QuestionField.innerText
Cells(RowNumber, 1).Value = DataOne
End If

If QuestionField.className = "_Fs" Then
DataThree = QuestionField.innerText
Cells(RowNumber, 2).Value = DataThree
End If

Next QuestionField
RowNumber = RowNumber + 1
Next
Set html = Nothing
MsgBox "Done!"

End Sub

非常感谢任何帮助。

非常感谢

1 个答案:

答案 0 :(得分:0)

我建议你研究XPath - 一种基于标准的查询语言,用于处理XML文档。您也可以将它与HTML文档结合使用。它有点神秘,但非常有用,也可以在VBA中使用。

您的示例HTML看起来有点复杂,因为您有多个具有相同类的<div>标记。由于//Extracted//标记中的<img>,它也不是有效的XML。此外,示例中没有结束标记。无论如何,我已经在下面的代码示例中整理了它。

我看了你的问题,并按照这样解释:

  

<span>标记中提取文章_MHbFs;以及它是<div>

_Xnb _QJ标记的后代

如果是这样,您的XPath查询可以分为三部分构建:

//div[@class='_Xnb _QJ']

含义 - 获取类_Xnb _QJ的任何div标签。

(//div[@class='_Xnb _QJ'])[last()]

含义 - 只需从第一组中获取最里面的项目(记住你有多个具有相同类的嵌套<div>标签)。

(//div[@class='_Xnb _QJ'])[last()]//span[@class='_MHb' or @class='_Fs']

含义 - 为<div><span>等级的_Mhb代码过滤最里面的_Fs

因此,如果包含MSXML库(我认为您已经完成),则可以在VBA中使用XPath。代码如下所示:

Option Explicit

Sub Test()

    Dim strXml As String
    Dim objXml As New DOMDocument60
    Dim strXPath As String
    Dim objXmlNodeList As IXMLDOMNodeList
    Dim objXmlNode As IXMLDOMNode

    'get the sample XML
    strXml = GetXml

    'load xml to document
    If Not objXml.LoadXML(strXml) Then
        Debug.Print "Not parsed"
        Exit Sub
    End If

    'apply XPath
    'first just let's get the last <div> tag of class _Xnb _QJ
    strXPath = "(//div[@class='_Xnb _QJ'])[last()]"
    'test that query
    Set objXmlNodeList = objXml.SelectNodes(strXPath)
    For Each objXmlNode In objXmlNodeList
        Debug.Print objXmlNode.XML
    Next objXmlNode

    'now lets append a filter to only get the <span> texts
    strXPath = strXPath & "//span[@class='_MHb' or @class='_Fs']"

    'get output nodes by applying query to xml
    Set objXmlNodeList = objXml.SelectNodes(strXPath)
    For Each objXmlNode In objXmlNodeList
        Debug.Print objXmlNode.Text
    Next objXmlNode

End Sub

Function GetXml() As String

    Dim strXml As String

    strXml = ""
    strXml = strXml & "<div class=""results"">"
    strXml = strXml & "  <div class=""_s2 _wPc"">"
    strXml = strXml & "    <div class=""_fW _QJ"">"
    strXml = strXml & "      <div class=""_Xnb _QJ _Z9b"">"
    strXml = strXml & "        <div class=""_Xnb _QJ"">"
    strXml = strXml & "          <div class=""_Xnb _QJ"">"
    strXml = strXml & "            <div class=""_Xnb _QJ"">"
    strXml = strXml & "              <a href=""//Extracted URL//"">"
    strXml = strXml & "                <span class=""_fbb"">"
    strXml = strXml & "                  <img id=""uid_3"" />"
    strXml = strXml & "                </span>"
    strXml = strXml & "                <span class=""_PHb"">"
    strXml = strXml & "                  <span class=""_MHb"">DATA ONE</span>"
    strXml = strXml & "                </span>"
    strXml = strXml & "                <span class=""_B6e"">"
    strXml = strXml & "                  <span class=""_x2"">DATA TWO</span>"
    strXml = strXml & "                  <span class=""_Fs""> DATA THREE </span>"
    strXml = strXml & "                </span>"
    strXml = strXml & "              </a>"
    strXml = strXml & "            </div>"
    strXml = strXml & "          </div>"
    strXml = strXml & "        </div>"
    strXml = strXml & "      </div>"
    strXml = strXml & "    </div>"
    strXml = strXml & "  </div>"
    strXml = strXml & "</div>"

    GetXml = strXml

End Function

调试输出如下所示:

<div class="_Xnb _QJ">
    <a href="//Extracted URL//">
        <span class="_fbb">
            <img id="uid_3"/>
        </span>
        <span class="_PHb">
            <span class="_MHb">DATA ONE</span>
        </span>
        <span class="_B6e">
            <span class="_x2">DATA TWO</span>
            <span class="_Fs"> DATA THREE </span>
        </span>
    </a>
</div>
DATA ONE
DATA THREE

这看起来有点复杂 - 但是一旦你尝试了几次就会没问题。