Question

import urllib.request
import bs4 as bs

sauce = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies').read().decode()
soup = bs.BeautifulSoup(sauce, 'lxml')

soup.th.decompose()

table = soup.find('table')
trows = soup.find_all('tr')

for trow in trows:
    td = trow.find_all('td')
    row = [x.text for x in td]
    print(row)

我一直在抓取网页和表格，这似乎是最困难的。但是，我能够很好地创建表数据的行列表。问题在于，由于表头<th>，打印了一个空白列表。当我只想打印row[0]和row[1]时，这会产生一个问题，因为它返回了此"IndexError: list index out of range"。我知道这是因为<th>属于<tr>，但没有<td>。

浏览bs4文档后，我尝试使用.decompose()删除<th>标头，但无济于事。仍然会生成一个空列表。在这个问题上的任何帮助，将不胜感激。谢谢。

Answer 1

当行返回[]时，您可以让它跳过到空行

for trow in trows:
    td = trow.find_all('td')
    row = [x.text for x in td]

    if row == []:
        continue

    print(row)

我也将指出，我讨厌尝试通过对<table>，<tr>，<td>等进行整个搜索来解析表。虽然有时是必要的，但是每当我看到<table>标签时，我都会先尝试Pandas，看看它是否可以相对地给我我想要的东西。我宁愿做一些操作数据框的工作，也不愿通过嵌套标签进行大量工作。

import urllib.request
import pandas as pd

sauce = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies').read().decode()

tables = pd.read_html(sauce)

因此.read_html()将返回数据帧列表。在这里，这里是2。因此，要查看它们，只需执行print (tables[0])或print (tables[1])

关于抓取，如何防止在for循环中创建空白列表？

1 个答案: