我能够成功从网站中提取数据,但一个字段除外,其标记为img alt。这是代码:
#import pandas as pd
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup
# gets a file-like object using urllib2.urlopen
url = 'http://ecal.forexpros.com/e_cal.php?duration=daily'
html = urlopen(url)
soup = BeautifulSoup(html)
# loops over all <tr> elements with class 'ec_bg1_tr' or 'ec_bg2_tr'
for tr in soup.find_all('tr', {'class': re.compile('ec_bg[12]_tr')}):
# finds desired data by looking up <td> elements with class names
event = tr.find('td', {'class': 'ec_td_event'}).text
currency = tr.find('td', {'class': 'ec_td_currency'}).text
actual = tr.find('td', {'class': 'ec_td_actual'}).text
forecast = tr.find('td', {'class': 'ec_td_forecast'}).text
previous = tr.find('td', {'class': 'ec_td_previous'}).text
time = tr.find('td', {'class': 'ec_td_time'}).text
importance = tr.find('td', {'class': 'ec_td_importance'}).text
# the returned strings are unicode, so to print them we need a unicode string
print u'{:3}\t{}\t{:5}\t{:8}\t{:8}\t{:8}\t{}'.format(currency, importance, time, actual, forecast, previous, event)
输出的前几条记录如下:
JPY 01:00 43.8 43.6 43.3 Household Confidence
CHF 01:45 -3 -3 -8 SECO Consumer Climate
RON 02:00 2.50% 3.30% PPI (YoY)
EUR 03:00 -26.9K -66.5K -98.3K Spanish Unemployment Change
CHF 03:15 1.5% 1.3% -0.8% Retail Sales (YoY)
CHF 03:30 60.9 58.9 60.1 SVME PMI
GBP 04:30 51.9 54.5 54.8 Construction PMI
importance
字段未显示在上面的输出中(可能是因为数据包含在img
alt
中)。
有谁知道如何解决这个问题?
谢谢!
修改
通过替换以下内容解决了该问题:
importance = tr.find('td', {'class': 'ec_td_importance'}).text
使用:
importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt')
答案 0 :(得分:1)
替换您的importance
行:
importance = tr.find('td', {'class': 'ec_td_importance'}).img['alt']