如何刮掉隐形的HTML?

时间:2018-04-06 18:44:26

标签: web-scraping

2 个答案:

答案 0 :(得分:1)

从技术上讲,它们不是不可见的,您查找的值不在您请求的初始HTML文档中。有关更多说明,请阅读此How do you scrape AJAX pages?

答案 1 :(得分:1)

看看下面的例子。 JSON.bas模块导入VBA项目以进行JSON处理。

<script src="https://cdnjs.cloudflare.com/ajax/libs/rxjs/5.5.8/Rx.js"></script>

<div class='examples'>
  <div onClick='mm()'>mergeMap </div>
  <div onClick='fm()'>flatMap</div>
  <div onClick='cm()'>concatMap</div>
  <div onClick='sm()'>switchMap</div>
  <div onClick='em()'>exhaustMap</div>
</div>

Scraping基于解析网址http://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/5081/FUT?tradeDate=04/06/2018&strategy=DEFAULT&pageSize=500的XHR响应,您可以在网页加载后在网络标签上的浏览器(例如Chrome)开发人员工具中找到这些请求。

上面代码的输出请求参数Option Explicit Sub Test() Dim sJSONString As String Dim vJSON Dim sState As String Dim aData() Dim aHeader() With CreateObject("MSXML2.XMLHTTP") .Open "GET", "http://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/5081/FUT?tradeDate=04/06/2018&strategy=DEFAULT&pageSize=500", False .send sJSONString = .responseText End With JSON.Parse sJSONString, vJSON, sState vJSON = vJSON("settlements") JSON.ToArray vJSON, aData, aHeader With Sheets(1) .Cells.Delete .Cells.WrapText = False OutputArray .Cells(1, 1), aHeader Output2DArray .Cells(2, 1), aData .Columns.AutoFit End With End Sub Sub OutputArray(oDstRng As Range, aCells As Variant) With oDstRng .Parent.Select With .Resize(1, UBound(aCells) - LBound(aCells) + 1) .NumberFormat = "@" .Value = aCells End With End With End Sub Sub Output2DArray(oDstRng As Range, aCells As Variant) With oDstRng .Parent.Select With .Resize( _ UBound(aCells, 1) - LBound(aCells, 1) + 1, _ UBound(aCells, 2) - LBound(aCells, 2) + 1) .NumberFormat = "@" .Value = aCells End With End With End Sub 对我来说如下:

output

顺便说一句,类似的方法适用于以下答案:1234567891011121314