Python - 分离数据时抛出的错误 - 列表索引超出范围

时间:2016-05-05 22:43:15

标签: python mysql beautifulsoup

使用Bs4抓取Yahoo表的收入日期。我的代码工作,直到我尝试将数据分成单元格。确切的错误是:

ticker = cells [1] .get_text()IndexError:列表索引超出范围

我认为这是因为桌子上有'a href'......但也有文字。

理想情况下,格式应如下所示:

{'company':'2U Inc','ticker':'TWOU','eps_est':' - 0.04','时间':'收市后'}

如何实现上述输出,我缺少什么?

from urlparse import urljoin
from urllib2 import urlopen
import requests
from bs4 import BeautifulSoup
import MySQLdb
import re

#mysql portion
mydb = MySQLdb.connect(host='localhost',
user= '####',
passwd='#####',
db='testdb')
cur = mydb.cursor()

#def store (company, ticker, eps_est, time):
#    cur.execute('INSERT IGNORE INTO EARN (company, ticker, eps_est, time)  VALUES ( \"%s\", \"%s\", \"%s\", \"%s\")',(company, ticker, eps_est, time))
#    cur.connection.commit()

base_url = "https://biz.yahoo.com/research/earncal/today.html"
html = urlopen(base_url)
soup = BeautifulSoup(html.read().decode('utf-8'),"lxml")
table = soup.find_all('table')
rows = table[6].find_all('tr')

for row in rows[2:]:
    cells = row.find_all('td')
    company = cells[0].get_text()
    ticker =  cells[1].get_text()
    eps_est = cells[2].get_text()
    time =    cells[3].get_text()
    #    store(company, ticker, eps_est, time)
data = {
    'company': cells[0].get_text(),
    'ticker': cells[1].get_link('href'),
    'eps_est': cells[2].get_text(),
    'time': cells[3].get_text(),
}
print data
print '\n'

1 个答案:

答案 0 :(得分:1)

使用“点符号”查找其他元素中的元素。替换:

cells[1].get_link('href')

使用:

cells[1].a.get_text()

应该被理解为等同于cells[1].find("a").get_text()

而且,您还需要跳过最后一个“空”行:

for row in rows[2:-1]: