Question

我正在尝试使用Beautifulsoup打印以列表格式给出的婴儿名字表。

google-python-exercises/google-python-exercises/babynames/baby1990.html （HTML页面是实际URL的屏幕截图）

使用urllib.request提取表并使用BeautifulSoup对其进行解析后，我能够在表的每一行中打印数据，但输出错误。

这是我的代码：

right_table = soup.find('table',attrs = {"summary" : "Popularity for top 1000"})
table_rows = right_table.find_all('tr') 

for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)

应该打印1个包含行中所有数据的列表，但是，我得到了许多列表，每个新列表都以少一个记录开始

这样的：

['997', 'Eliezer', 'Asha', '998', 'Jory', 'Jada', '999', 'Misael', 'Leila', '1000', 'Tate', 'Peggy', 'Note: Rank 1 is the most popular,\nrank 2 is the next most popular, and so forth. \n']
['998', 'Jory', 'Jada', '999', 'Misael', 'Leila', '1000', 'Tate', 'Peggy', 'Note: Rank 1 is the most popular,\nrank 2 is the next most popular, and so forth. \n']
['999', 'Misael', 'Leila', '1000', 'Tate', 'Peggy', 'Note: Rank 1 is the most popular,\nrank 2 is the next most popular, and so forth. \n']
['1000', 'Tate', 'Peggy', 'Note: Rank 1 is the most popular,\nrank 2 is the next most popular, and so forth. \n']
['Note: Rank 1 is the most popular,\nrank 2 is the next most popular, and so forth. \n']

如何仅打印一个列表？

Answer 1

我会尝试使用pandas和索引到表的结果列表中以获得所需的表

import pandas as pd

tables = pd.read_html('yourURL')

print(tables[1]) # for example; change index as required

Answer 2

您的循环正在创建行列表，然后打印它，然后进入下一个迭代，在该迭代中，它创建一个行列表（覆盖上一个），然后打印它，等等，等等。

不确定为什么要将所有行都合并到一个列表中，但是要拥有一个最终列表，则需要在每次迭代时将每个行列表附加到最终列表中。

您实际上是说想要行列表的列表吗？

right_table = soup.find('table',attrs = {"summary" : "Popularity for top 1000"})
table_rows = right_table.find_all('tr') 


result_list = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    result_list = result_list + row


print(result_list)

如果您确实要列出行列表，请使用以下列表：

right_table = soup.find('table',attrs = {"summary" : "Popularity for top 1000"})
table_rows = right_table.find_all('tr') 


result_list = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    result_list.append(row)


print(result_list)

但是，老实说，我会按照QHarr的建议使用pandas和.read_html（）。

right_table = soup.find('table',attrs = {"summary" : "Popularity for top 1000"})
table_rows = right_table.find_all('tr') 


result_list = []
for tr in table_rows:
    td = tr.find_all('td')
    for data in td:
        print (td.text)

在Jupyter Notebook中使用BeautifulSoup刮擦桌子

2 个答案: