尝试从网页表中提取数据。页面上显示的数据四舍五入到小数点后三位,但实际单元格值是小数点后四位。我需要完整的,未四舍五入的数字。
我的循环:
for i in range(0,20):
soup = BeautifulSoup(html_source,'lxml')
table = soup.find_all('table')[i]
df = pd.read_html(str(table))
print(region,i)
print( tabulate(df[0], headers='keys', tablefmt='psql') )
网页元素:
<span class="price-data " data-amount="{"regional":
{"asia-pacific-east":0.022,"japan-
east":0.0176,"japan-west":0.0206,"us-
west":0.0164,"us-west-2":0.0144,"us-west-
central":0.018,"west-india":0.0193}}" data-decimals="3"
data-decimals-force="3" data-month-format="{0}/month" data-hour-format="
{0}/hour" data-region-unavailable="N/A" data-has-valid-
price="true">$0.018/hour</span>
我的代码显示0.018/hour
,我需要它显示0.0176/hour
。
注意:这是针对日本东部的(示例数据中也包含日本西部)。
答案 0 :(得分:2)
假设JSON格式正确,则可以从data-amount
的{{1}}属性中提取它,如下所示:
<span>
将显示:
from bs4 import BeautifulSoup
import html
import json
html_text = """<span class="price-data " data-amount="{"regional":{"asia-pacific-east":0.022,"japan-east":0.0176,"japan-west":0.0206,"us-west":0.0164,"us-west-2":0.0144,"us-west-central":0.018,"west-india":0.0193}}" data-decimals="3" data-decimals-force="3" data-month-format="{0}/month" data-hour-format="{0}/hour"data-region-unavailable="N/A" data-has-valid-price="true">$0.018/hour</span>"""
soup = BeautifulSoup(html_text, "html.parser")
da = html.unescape(soup.span['data-amount'])
data_amount = json.loads(da)
print(data_amount['regional']['japan-east'])
答案 1 :(得分:1)
您还可以如图所示更正json并使用以下内容
from bs4 import BeautifulSoup
import json
html = '''<span class="price-data " data-amount="{"regional":
{"asia-pacific-east":0.022,"japan-
east":0.0176,"japan-west":0.0206,"us-
west":0.0164,"us-west-2":0.0144,"us-west-
central":0.018,"west-india":0.0193}}" data-decimals="3"
data-decimals-force="3" data-month-format="{0}/month" data-hour-format="
{0}/hour" data-region-unavailable="N/A" data-has-valid-
price="true">$0.018/hour</span>'''
soup = BeautifulSoup(html,'lxml')
items = soup.select('span.price-data')
for item in items:
if item.has_attr('data-amount'):
val = json.loads(item['data-amount'].replace('\n', ' ').replace(' ',''))
print(val['regional']['japan-east'])