在Jupyter Notebook中使用BeautifulSoup刮擦桌子

时间:2019-03-02 19:14:11

标签: python web-scraping beautifulsoup

我正在尝试使用Beautifulsoup打印以列表格式给出的婴儿名字表。

google-python-exercises/google-python-exercises/babynames/baby1990.html (HTML页面是实际URL的屏幕截图)

使用urllib.request提取表并使用BeautifulSoup对其进行解析后,我能够在表的每一行中打印数据,但输出错误。

这是我的代码:

right_table = soup.find('table',attrs = {"summary" : "Popularity for top 1000"})
table_rows = right_table.find_all('tr') 

for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)

应该打印1个包含行中所有数据的列表,但是,我得到了许多列表,每个新列表都以少一个记录开始

这样的:

['997', 'Eliezer', 'Asha', '998', 'Jory', 'Jada', '999', 'Misael', 'Leila', '1000', 'Tate', 'Peggy', 'Note: Rank 1 is the most popular,\nrank 2 is the next most popular, and so forth. \n']
['998', 'Jory', 'Jada', '999', 'Misael', 'Leila', '1000', 'Tate', 'Peggy', 'Note: Rank 1 is the most popular,\nrank 2 is the next most popular, and so forth. \n']
['999', 'Misael', 'Leila', '1000', 'Tate', 'Peggy', 'Note: Rank 1 is the most popular,\nrank 2 is the next most popular, and so forth. \n']
['1000', 'Tate', 'Peggy', 'Note: Rank 1 is the most popular,\nrank 2 is the next most popular, and so forth. \n']
['Note: Rank 1 is the most popular,\nrank 2 is the next most popular, and so forth. \n']

如何仅打印一个列表?

2 个答案:

答案 0 :(得分:1)

我会尝试使用pandas和索引到表的结果列表中以获得所需的表

import pandas as pd

tables = pd.read_html('yourURL')

print(tables[1]) # for example; change index as required

答案 1 :(得分:0)

您的循环正在创建行列表,然后打印它,然后进入下一个迭代,在该迭代中,它创建一个行列表(覆盖上一个),然后打印它,等等,等等。

不确定为什么要将所有行都合并到一个列表中,但是要拥有一个最终列表,则需要在每次迭代时将每个行列表附加到最终列表中。

您实际上是说想要行列表的列表吗?

right_table = soup.find('table',attrs = {"summary" : "Popularity for top 1000"})
table_rows = right_table.find_all('tr') 


result_list = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    result_list = result_list + row


print(result_list)

如果您确实要列出行列表,请使用以下列表:

right_table = soup.find('table',attrs = {"summary" : "Popularity for top 1000"})
table_rows = right_table.find_all('tr') 


result_list = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    result_list.append(row)


print(result_list)

但是,老实说,我会按照QHarr的建议使用pandas和.read_html()。

right_table = soup.find('table',attrs = {"summary" : "Popularity for top 1000"})
table_rows = right_table.find_all('tr') 


result_list = []
for tr in table_rows:
    td = tr.find_all('td')
    for data in td:
        print (td.text)