使用VBA从网站上刮取表格

时间:2019-02-20 19:28:04

标签: html excel vba internet-explorer web-scraping

我是VBA和网站的新手。

我正在尝试从下面的网站中提取数据(表格)以用于VBA代码。

http://www.bkam.ma/Marches/Principaux-indicateurs/Marche-obligataire/Marche-des-bons-de-tresor/Marche-secondaire/Taux-de-reference-des-bons-du-tresor?date=13%2F02%2F2019&block=e1d6b9bbf87f86f8ba53e8518e882982#address-c3367fcefc5f524397748201aee5dab8-e1d6b9bbf87f86f8ba53e8518e882982

我尝试创建Internet Explorer浏览器:

Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")

With appIE
    .Navigate "http://www.bkam.ma/Marches/Principaux-indicateurs/Marche-obligataire/Marche-des-bons-de-tresor/Marche-secondaire/Taux-de-reference-des-bons-du-tresor?date=13%2F02%2F2019&block=e1d6b9bbf87f86f8ba53e8518e882982#address-c3367fcefc5f524397748201aee5dab8-e1d6b9bbf87f86f8ba53e8518e882982"
    .Visible = True
End With

Do While appIE.Busy
    DoEvents
Loop

然后,我尝试使用ID或标记名属性来获取数据

Set val = appIE.document.getElementById()

我不知道如何获取表中的元素,因为它们没有我可以使用的ID或标记名。如您在源代码中的这段代码中所见

                              </span>
                                           </div>
                                       </th>
                                                                                                                                                                                        </tr>
                            </thead>
                            <tbody>
                                                
                                                                                         
                                                                                                                                                                                    <tr>
                             
         <td>18/03/2019</td>
      
         <td><span class="number">20,05</sapn>&nbsp;<span class="symbol"></span></td>
      
         <td><span class="number">2,250</sapn>&nbsp;<span class="symbol">%</span></td>
      
         <td>13/02/2019</td>
      
    
                             </tr>
                        
                                             

此代码段显示了我要提取的表的第一行。

2 个答案:

答案 0 :(得分:1)

您可以避免使用浏览器,并使用xmlhttp获取页面内容,然后按其类选择表元素(没有要使用的id,并且class是ID之后的下一个最快的选择器),然后循环写行和列出来。

Option Explicit
Public Sub GetTable()
    Dim html As MSHTML.HTMLDocument, hTable As Object, ws As Worksheet
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set html = New MSHTML.HTMLDocument                  '<  VBE > Tools > References > Microsoft Scripting Runtime
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "http://www.bkam.ma/Marches/Principaux-indicateurs/Marche-obligataire/Marche-des-bons-de-tresor/Marche-secondaire/Taux-de-reference-des-bons-du-tresor?date=13%2F02%2F2019&block=e1d6b9bbf87f86f8ba53e8518e882982#address-c3367fcefc5f524397748201aee5dab8-e1d6b9bbf87f86f8ba53e8518e882982", False
        .send
        html.body.innerHTML = .responseText
    End With
    Set hTable = html.querySelector(".dynamic_contents_ref_12")
    Dim td As Object, tr As Object, th As Object, r As Long, c As Long
    For Each tr In hTable.getElementsByTagName("tr")
        r = r + 1: c = 1
        For Each th In tr.getElementsByTagName("th")
            ws.Cells(r, c) = th.innerText
            c = c + 1
        Next
        For Each td In tr.getElementsByTagName("td")
            ws.Cells(r, c) = td.innerText
            c = c + 1
        Next
    Next
End Sub

答案 1 :(得分:0)

首先,您可以根据其类属性找到该表

Set HTMLTable = appIE.document.getElementsByClassName("dynamic_contents_ref_12")(0)

这将获得具有类名dynamic_contents_ref_12的HTML元素数组,并返回其第一个元素。

然后,您可以使用.Children属性“爬网”表

这将为您提供第一行:

Set TBody = HTMLTable.Children(1) 'The <tbody> tag is the second child
Set Row1 = TBody.Children(0)      'The first <tr> inside the <tbody> tag

对于每一行,在括号中放置一个不同的索引。

现在Row1中的HTML看起来

<tr>

  <td>
    18/03/2019
  </td>

  <td>
    <span class="number">
      20,05&nbsp;
      <span class="symbol"></span>
    </span>
  </td>

  <td>
    <span class="number">
      2,250&nbsp;
      <span class="symbol">%</span>
    </span>
  </td>

  <td>
    13/02/2019
  </td>

</tr>

(每个<td>是该行中的一个单元格。)

要在单元格中获取文本,我们可以使用.innerText方法,该方法返回一个字符串:

CellA1 = Row1.Children(0).innerText ' = "05/04/2019"
CellB1 = Row1.Children(1).innerText ' = "43,85 "

将它们放在一起

使用For Each循环,我们可以从HTML表中获取所有单元格并将其复制到工作表中-假设您要从单元格 A1 开始。

'Table Headers
ActiveSheet.Range("A1").Value = "Date d'échéance"
ActiveSheet.Range("B1").Value = "Transaction"
ActiveSheet.Range("C1").Value = "Taux moyen pondéré"
ActiveSheet.Range("D1").Value = "Date de la valeur"

Set HTMLTable = appIE.document.getElementsByClassName("dynamic_contents_ref_12")(0)
Set TBody = HTMLTable.Children(1)
RowIndex = 2
For Each Row in TBody.Children
  ActiveSheet.Cells(RowIndex, 1).Value = Row.Children(0).innerText
  ActiveSheet.Cells(RowIndex, 2).Value = Row.Children(1).innerText
  ActiveSheet.Cells(RowIndex, 3).Value = Row.Children(2).innerText
  ActiveSheet.Cells(RowIndex, 4).Value = Row.Children(3).innerText
  RowIndex = RowIndex + 1
Next