动态文本刮擦

时间:2017-08-02 23:29:10

标签: javascript python-3.x beautifulsoup

我遵循了许多关于Javascript Scraping的教程,但我无法真正设法从这个表中取出数字:

http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html

我使用此代码的Sentdex教程尝试了最后一次:

{
  ".properties.on_demand_service_plans_collection": {
    "type": "collection",
    "value": [
      {
        "plan_name": "my-plan",
        "plan_description": "my-plan",
        "account_name": "vault-supplied-value"
        "account_access_key": "vault-supplied-value"
      },
      {
        "plan_name": "my-plan-test",
        "plan_description": "my-plan-test",
        "account_name": "vault-supplied-value",
        "account_access_key": "vault-supplied-value"
      }
    ],
    "optional": false
  }
}

看起来我已经没有目标...每个人总是说一个与网页源中出现的那些文本相关联的脚本,但随后在漂亮的汤标签文本中消失......但我找不到与上面页面主表中的值相关联的脚本..?

关于我应该指导我的研究的任何建议?

1 个答案:

答案 0 :(得分:2)

请注意您要抓取的表位于iframe内,您应该请求此iframe,然后继续刮取表格。 iframe网址是通过对元素的简单检查发现的。使用requests的示例代码如下所示:

from bs4 import BeautifulSoup
import requests

iframe = "https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWQEqN6Sq2pe6I0o/TehV5qd"
html = requests.get(iframe).text
soup = BeautifulSoup(html,'html.parser')

column = soup.findAll("td",{"class": "col2 yellowBack"})
values = [row.string for row in column]

您似乎对该列中的值感兴趣,因此values是所需的输出:

>>> values
['56.37', '107.75', 'n.a.', '95.99', 'n.a.', '56.00', '52.32', '234.85', '81.21', '40.72', '76.29', '19.90', 'n.a.', '92.41', '12.83', '62.19', '78.28', '60.51', '4995.58', '92.99', '67.56', '175.24', '58.71', '82.14', '57.75', '46.86', '22.95', '70.06', '150.16', '6793.46', '31.07', '34.31', '50.39']
相关问题