我遵循了许多关于Javascript Scraping的教程,但我无法真正设法从这个表中取出数字:
http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html
我使用此代码的Sentdex教程尝试了最后一次:
{
".properties.on_demand_service_plans_collection": {
"type": "collection",
"value": [
{
"plan_name": "my-plan",
"plan_description": "my-plan",
"account_name": "vault-supplied-value"
"account_access_key": "vault-supplied-value"
},
{
"plan_name": "my-plan-test",
"plan_description": "my-plan-test",
"account_name": "vault-supplied-value",
"account_access_key": "vault-supplied-value"
}
],
"optional": false
}
}
看起来我已经没有目标...每个人总是说一个与网页源中出现的那些文本相关联的脚本,但随后在漂亮的汤标签文本中消失......但我找不到与上面页面主表中的值相关联的脚本..?
关于我应该指导我的研究的任何建议?
答案 0 :(得分:2)
请注意您要抓取的表位于iframe
内,您应该请求此iframe
,然后继续刮取表格。 iframe
网址是通过对元素的简单检查发现的。使用requests
的示例代码如下所示:
from bs4 import BeautifulSoup
import requests
iframe = "https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWQEqN6Sq2pe6I0o/TehV5qd"
html = requests.get(iframe).text
soup = BeautifulSoup(html,'html.parser')
column = soup.findAll("td",{"class": "col2 yellowBack"})
values = [row.string for row in column]
您似乎对该列中的值感兴趣,因此values
是所需的输出:
>>> values
['56.37', '107.75', 'n.a.', '95.99', 'n.a.', '56.00', '52.32', '234.85', '81.21', '40.72', '76.29', '19.90', 'n.a.', '92.41', '12.83', '62.19', '78.28', '60.51', '4995.58', '92.99', '67.56', '175.24', '58.71', '82.14', '57.75', '46.86', '22.95', '70.06', '150.16', '6793.46', '31.07', '34.31', '50.39']