使用Beautiful Soup使用img alt标签提取文本

时间:2017-08-02 17:50:59

标签: python beautifulsoup

我能够成功从网站中提取数据,但一个字段除外,其标记为img alt。这是代码:

#import pandas as pd
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup

# gets a file-like object using urllib2.urlopen
url = 'http://ecal.forexpros.com/e_cal.php?duration=daily'
html = urlopen(url)

soup = BeautifulSoup(html)

# loops over all <tr> elements with class 'ec_bg1_tr' or 'ec_bg2_tr'
for tr in soup.find_all('tr', {'class': re.compile('ec_bg[12]_tr')}):
    # finds desired data by looking up <td> elements with class names
    event = tr.find('td', {'class': 'ec_td_event'}).text
    currency = tr.find('td', {'class': 'ec_td_currency'}).text
    actual = tr.find('td', {'class': 'ec_td_actual'}).text
    forecast = tr.find('td', {'class': 'ec_td_forecast'}).text
    previous = tr.find('td', {'class': 'ec_td_previous'}).text
    time = tr.find('td', {'class': 'ec_td_time'}).text
    importance = tr.find('td', {'class': 'ec_td_importance'}).text

    # the returned strings are unicode, so to print them we need a unicode string
    print u'{:3}\t{}\t{:5}\t{:8}\t{:8}\t{:8}\t{}'.format(currency, importance, time, actual, forecast, previous, event)

输出的前几条记录如下:

JPY     01:00   43.8        43.6        43.3        Household Confidence 
CHF     01:45   -3          -3          -8          SECO Consumer Climate 
RON     02:00   2.50%                   3.30%       PPI (YoY) 
EUR     03:00   -26.9K      -66.5K      -98.3K      Spanish Unemployment Change 
CHF     03:15   1.5%        1.3%        -0.8%       Retail Sales (YoY) 
CHF     03:30   60.9        58.9        60.1        SVME PMI 
GBP     04:30   51.9        54.5        54.8        Construction PMI

importance字段未显示在上面的输出中(可能是因为数据包含在img alt中)。

有谁知道如何解决这个问题?

谢谢!

修改

通过替换以下内容解决了该问题:

importance = tr.find('td', {'class': 'ec_td_importance'}).text

使用:

importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt')

1 个答案:

答案 0 :(得分:1)

替换您的importance行:

importance = tr.find('td', {'class': 'ec_td_importance'}).img['alt']