我很欣赏这已被多次询问过,但我已经被困在这里很长一段时间了。
我正在尝试从网站上的表中获取所有数据并将其放入pandas数据框中。
我已经编写了代码来进行网页抓取,但由于某种原因,我在尝试写入变量时遇到了错误。
import requests
import requests
url = 'http://www.londonstockexchange.com/exchange/prices/stocks/summary/fundamentals.html?fourWayKey=GB00BCDBXK43GBGBXASX1'
page = requests.get(url).text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
# print(soup.prettify())
all_tables = soup.find_all('table')
right_table = soup.find_all('table', {'class':'table_dati'})
tbl1 = right_table[0]
A = []
B = []
C = []
D = []
E = []
F = []
for row in tbl1.find_all('tr'):
cells = row.find_all('td')
A.append(cells[0].find(text = True))
B.append(cells[1].find(text = True))
C.append(cells[2].find(text = True))
D.append(cells[3].find(text = True))
E.append(cells[4].find(text = True))
F.append(cells[5].find(text = True))
这是错误:
A.append(cells[0].find(text = True))
IndexError: list index out of range
感谢帮助, 感谢
答案 0 :(得分:0)
好吧,如果你看到html代码,你的第一次迭代没有td
(是thead
),所以当你试图获得第一个元素时,它没有&# 39; t存在,因为细胞是空的。
这是第一行:
<tr>
<th class="name">Income Statement</th>
<th>
31-May-13 <br>( £
m )
</th>
<th>
31-May-14 <br>( £
m )
</th>
<th>
31-May-15 <br>( £
m )
</th>
<th>
31-May-16 <br>( £
m )
</th>
<th>
31-May-17 <br>( £
m )
</th>
</tr>
您可以尝试使用“围绕”,或选择tbody
。
根据您的代码,您可以添加find_all()
标签列表,以及
然后当单元格列表的长度小于6时跳转,但是将来最好动态创建列表,而不是修复所有内容。
for row in tbl1.find_all('tr'):
try:
cells = row.find_all(['td', 'th'])
if len(cells) < 6:
continue
A.append(cells[0].find(text = True).strip())
B.append(cells[1].find(text = True).strip())
C.append(cells[2].find(text = True).strip())
D.append(cells[3].find(text = True).strip())
E.append(cells[4].find(text = True).strip())
F.append(cells[5].find(text = True).strip())
except Exception as e:
print(e)
print(A)
输出结果为:
[
"Income Statement",
"Revenue",
"Operating Profit/(Loss)",
"Net Interest",
"Profit Before Tax",
"Profit After Tax",
"Profit After Tax",
"PROFIT FOR THE PERIOD",
"Minority Interests",
"Equity Holders of Parent Company",
"Earnings per Share - Basic",
"Earnings per Share - Diluted",
"Earnings per Share - Adjusted",
"Earnings per Share - Basic",
"Earnings per Share - Diluted",
"Earnings per Share - Adjusted",
"Dividend per Share"
]