Question

我正在尝试从一个网页上抓取一堆表格，使用下面的代码我可以得到一张表格，输出可以用熊猫正确显示，但是一次最多只能得到一张表格。

import bs4 as bs
import urllib.request
import pandas as pd

source = urllib.request.urlopen('https://www.URLHERE.com').read()
soup = bs.BeautifulSoup(source, 'lxml')
tables = soup.select('table')[-1]
rows = tables.find_all('tr')
output = []

for rows in rows:
    cols = rows.find_all('td') 
    cols = [item.text.strip() for item in cols] 
    output.append([item for item in cols if item])
df = pd.DataFrame(output, columns = ['1','2', '3', '4', '5', '6'])
df = df.iloc[1:]

print(df)

如果我从表变量中删除[-1]，则会收到以下错误。

AttributeError: 'list' object has no attribute 'find_all'

要使所有表格都离开页面，我需要更改什么？

Answer 1

您已经走上了正确的道路，就像评论员已经说过的那样，您将需要find_all个表，然后可以将已经使用的行逻辑循环应用到每个表中，而不是只是第一张桌子。您的代码将如下所示：

tables = soup.find_all('table')
for table in tables:
    # individual table logic here

    rows = table.find_all('tr')
    for row in rows:
        # individual row logic here

Answer 2

我对此做了更好的了解，这是我测试过的示例代码：

source = urllib.request.urlopen('URL').read()
soup = bs.BeautifulSoup(source, 'lxml')
tables = soup.select('table')
print("I found " + str(len(tables)) + " tables.")

all_rows = []
for table in tables:
    print("Searching for <tr> items...")
    rows = table.find_all('tr')
    print("Found " + str(len(rows)) + "rows.")
    for row in rows:
        all_rows.append(row)


print("In total i have got " + str(len(all_rows)) + " rows.")

# example of first row
print(all_rows[0])

很少有解释：删除[-1]时出现“归因错误”的问题是tables变量是List对象-且没有find_all方法。

您使用[-1]的曲目是可以的-我假设您知道[-1]从列表中获取了最后一个项目。因此，您必须对所有元素都做同样的事情-如上面的代码所示。

您可能会发现有趣的内容来了解关于for在python和可迭代对象上的构建：https://pythonfordatascience.org/for-loops-and-iterations-python/

Answer 3

好吧，如果您想一次性提取网页上的所有不同表格，您应该尝试：

tables = pd.read_html("<URL_HERE>")

tables 将是该页面上每个表的数据框列表。

有关更具体的文档，请参阅 Pandas-Documentation

在一个网页上抓取多个单独的表格

3 个答案: