索引错误:在html表web-scrape上列出超出范围 - Python

时间:2018-03-13 21:19:49

标签: web-scraping index-error

我很欣赏这已被多次询问过,但我已经被困在这里很长一段时间了。

我正在尝试从网站上的表中获取所有数据并将其放入pandas数据框中。

我已经编写了代码来进行网页抓取,但由于某种原因,我在尝试写入变量时遇到了错误。

import requests
import requests
url = 'http://www.londonstockexchange.com/exchange/prices/stocks/summary/fundamentals.html?fourWayKey=GB00BCDBXK43GBGBXASX1'

page = requests.get(url).text

from bs4 import BeautifulSoup

soup = BeautifulSoup(page)

# print(soup.prettify())

all_tables = soup.find_all('table')

right_table = soup.find_all('table', {'class':'table_dati'})
tbl1 = right_table[0]

A = []
B = []
C = []
D = []
E = []
F = []

for row in tbl1.find_all('tr'):
  cells = row.find_all('td')
  A.append(cells[0].find(text = True))
  B.append(cells[1].find(text = True))
  C.append(cells[2].find(text = True))
  D.append(cells[3].find(text = True))
  E.append(cells[4].find(text = True))
  F.append(cells[5].find(text = True))

这是错误:

A.append(cells[0].find(text = True))

IndexError: list index out of range

感谢帮助, 感谢

1 个答案:

答案 0 :(得分:0)

好吧,如果你看到html代码,你的第一次迭代没有td(是thead),所以当你试图获得第一个元素时,它没有&# 39; t存在,因为细胞是空的。

这是第一行:

<tr>
   <th class="name">Income Statement</th>
   <th>
      31-May-13 <br>( £
      m&nbsp;)
   </th>
   <th>
      31-May-14 <br>( £
      m&nbsp;)
   </th>
   <th>
      31-May-15 <br>( £
      m&nbsp;)
   </th>
   <th>
      31-May-16 <br>( £
      m&nbsp;)
   </th>
   <th>
      31-May-17 <br>( £
      m&nbsp;)
   </th>
</tr>

您可以尝试使用“围绕”,或选择tbody

根据您的代码,您可以添加find_all()标签列表,以及 然后当单元格列表的长度小于6时跳转,但是将来最好动态创建列表,而不是修复所有内容。

for row in tbl1.find_all('tr'):
    try:
        cells = row.find_all(['td', 'th'])
        if len(cells) < 6:
            continue
        A.append(cells[0].find(text = True).strip())
        B.append(cells[1].find(text = True).strip())
        C.append(cells[2].find(text = True).strip())
        D.append(cells[3].find(text = True).strip())
        E.append(cells[4].find(text = True).strip())
        F.append(cells[5].find(text = True).strip())
    except Exception as e:
        print(e)
print(A)

输出结果为:

[
  "Income Statement",
  "Revenue",
  "Operating Profit/(Loss)",
  "Net Interest",
  "Profit Before Tax",
  "Profit After Tax",
  "Profit After Tax",
  "PROFIT FOR THE PERIOD",
  "Minority Interests",
  "Equity Holders of Parent Company",
  "Earnings per Share - Basic",
  "Earnings per Share - Diluted",
  "Earnings per Share - Adjusted",
  "Earnings per Share - Basic",
  "Earnings per Share - Diluted",
  "Earnings per Share - Adjusted",
  "Dividend per Share"
]