HTML抓取中的正确语法

时间:2018-04-20 08:24:02

标签: excel vba excel-vba

我有一个动态改变的代码

<tbody>
' ------------------- Block 1 ----------------------
   <tr class="table-row">
      <td class="cell">
         <div>18/4/2018</div>
      </td>
      <td class="cell">
         <div>
            <form id="idc" method="post" action=""> ' id is dinamic so cant use it
               <div style=""><input type="hidden" name="idc_hf_0" id="idc_hf_0" /></div> ' id and name is dinamic so cant use them
               Download all invoice documents as ZIP-file
               <span>
               <a class="icon zipdownload" title="Download all invoice documents as ZIP-file" href=""></a>
               </span>
               <span class="has-explanation">
               <a class="helper" href="javascript:;" title="The zip-file contains only PDF files of Tax/Fee statements and the Fleet Invoice with all annexes if available.">
               <span class="icon question" id="table-header-explanation"></span>
               </a>
               </span>
            </form>
         </div>
      </td>
      <td class="cell">
         <div>
            <a class="" title="View &gt;&gt;" href="">View &gt;&gt;</a>
         </div>
      </td>
   </tr>
 ' ################### Block1 END #######################

 ' ------------------- Block 2 ----------------------
   <tr class="table-row">
      <td class="cell">
         <div>13/4/2018</div> ' need this
      </td>
      <td class="cell">
         <div>
            <form id="idd" method="post" action="">
               <div style=""><input type="hidden" name="idd_hf_0" id="idd_hf_0" /></div>
               <div>
                  <span>Collective Payment Order</span> (<span>2018-500421707</span>)
                  <span>
                  <span class="invisible"> | </span><span>
                  <a class="Download" title="Download" href="">English</a>
                  </span>
                  </span>
               </div>
               <div>
                  <span>Tax/Fee CSV list</span> <span>
                  <a class="icon csv" title="Download" href=""></a>  ' need this  HREF1
                  </span>
               </div>
               <div>
                  <span>Detailed Trip CSV list</span> <span>
                  <a class="icon csv" title="Download" href=""></a> ' need this HREF2
                  </span>
               </div>
               Download all invoice documents as ZIP-file
               <span>
               <a class="icon zipdownload" title="Download all invoice documents as ZIP-file" href=""></a>
               </span>
               <span class="has-explanation">
               <a class="helper" href="javascript:;" title="The zip-file contains only PDF files of Tax/Fee statements and the Fleet Invoice with all annexes if available.">
               <span class="icon question" id="table-header-explanation"></span>
               </a>
               </span>
            </form>
         </div>
      </td>
      <td class="cell">
         <div>
            <a class="" title="View &gt;&gt;" href="">View &gt;&gt;</a>
         </div>
      </td>
   </tr>
  ' ################### Block2 END #######################

<tbody>

所以有两个块是动态的。所以可以是这样的结构

Block1
Block1
Block2
Block1
Block2
Block2
Block2
Block1

我需要从这些块中获取:

  1. Block2的数量
  2. 每个区块的日期2
  3. HREF1来自class =&#34; icon csv&#34;
  4. HREF2来自class =&#34; icon csv&#34;
  5.   

    区分块1和块1没有   class="icon csv"<span>Tax/Fee CSV list</span> <span>

    我很困惑如何使用getelement属性,试图获得

    Set IeDoc = IeApp.Document
        With IeDoc
            Set IeTbody = .getElementsByTagName("tbody").getElementsByClassName("table-row")
            d = IeTbody.legth
            For Each stEl In IeTbody
    
            Next stEl
    
        End With
    

    但得到错误&#34;对象不支持此属性或方法&#34;,也许使用更好的querySelector? 如何获得链接?

    逻辑上它必须是

    Set IeDoc = IeApp.Document
        With IeDoc
            Set Blocks = .getElementsByTagName("tbody")
    
        For Each block In Blocks
            Set hasClass = .getElementsByClassName("table-row").getElementsByClassName("cell")(1).getElementsByClassName("icon csv")
            if not hasClass is nothing then
                b.Date = Blocks(block).getElementsByClassName("table-row").getElementsByClassName("cell")(0).getElementsByTagName("div")(0).innerText()
                b.Href1 = Blocks(block).getElementsByClassName("table-row").getElementsByClassName("cell")(1).getElementsByClassName("icon csv")(0)
                b.Href2 = Blocks(block).getElementsByClassName("table-row").getElementsByClassName("cell")(1).getElementsByClassName("icon csv")(1)
            end if
        Next block
    
    End With
    

1 个答案:

答案 0 :(得分:1)

所以这不是非常强大,但是正在使用Regex并解析您提供的HTML。看看背后会有助于与正则表达式分开,但我目前无法解决这个问题。我目前通过@FlorentB

调整了正则表达式函数
Public Matches As Object
' Or add in Tools > References > VBScript Reg Exp for early binding
Public Sub testing()
    Dim str As String, countOfBlock2   As Long, arr() As String, i As Long
    str = Range("A1") 'I am reading in from sheet but this would be your response text
    arr = SplitRe(str, "\<div>[\d]+[\/-][\d]+[\/-][\d]+\<\/div>") 'look behind would help

    For i = LBound(arr) To UBound(arr)

        If InStr(1, arr(i), "class=""icon csv""") > 0 Then
           countOfBlock2 = countOfBlock2 + 1 ' "Block 2"
           Debug.Print Replace(Replace(Matches(i - 1), "<div>", ""), "</div>", "") 'dates from Block 2
           Debug.Print Split(Split(arr(i), """icon csv"" title=""Download"" href=")(1), "></a>")(0)
           Debug.Print Split(Split(arr(i), """icon csv"" title=""Download"" href=")(2), "></a>")(0)
        End If

   Next i

   Debug.Print "count of block2 = " & countOfBlock2

End Sub

    'https://stackoverflow.com/questions/28107005/splitting-string-in-vba-using-regex?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
Public Function SplitRe(Text As String, Pattern As String, Optional IgnoreCase As Boolean) As String()
    Static re As Object

    If re Is Nothing Then
        Set re = CreateObject("VBScript.RegExp")
        re.Global = True
        re.MultiLine = True
    End If

    re.IgnoreCase = IgnoreCase
    re.Pattern = Pattern
    SplitRe = Strings.Split(re.Replace(Text, ChrW(-1)), ChrW(-1))

     Set Matches = re.Execute(Text)

End Function

输出:

Output