我正在尝试使用beautifulsoup和请求从以下站点抓取表格数据: https://www.worldometers.info/world-population/
运行代码时,我会发现这种错误:
> Traceback (most recent call last): File
> "d:\python\population\worldpop.py", line 16, in <dictcomp>
> result=[{ header[index]:cells.text for index,cells in enumerate(row.find_all('td'))} for row in rows_data] IndexError:
> list index out of range
当然,我知道在访问超出范围的项目时会发生这种类型的错误,但是对于此特定问题,我却遇到了麻烦。 我期待这个问题的适当解决方案。
#worknig关于从worldometers.info抓取的表格数据并将其转换为csv文件。
from bs4 import BeautifulSoup
import requests
import pandas
url='https://www.worldometers.info/world-population/'
def world_population():
page=requests.get(url)
soup=BeautifulSoup(page.content,'html.parser')
pop_data=soup.find('table', class_='table table-striped table-bordered table-hover table-condensed
table-list')
header=[heading.text for heading in pop_data.find_all('th')]
#print(header)
rows_data=[row for row in pop_data.find_all('tr')]
result=[{ header[index]:cells.text for index,cells in enumerate(row.find_all('td'))} for row in
rows_data]
df=pandas.DataFrame(result)
df.to_csv('pop.csv')
world_population()
答案 0 :(得分:0)
您可以使用熊猫的.read_html()
来解析<table>
标签。它将返回给您表的列表,作为数据框的列表。然后,只需从索引值中拉出所需的表即可。
import requests
import pandas as pd
url='https://www.worldometers.info/world-population/'
def world_population():
page=requests.get(url)
df = pd.read_html(page.text)[0]
df.to_csv('pop.csv')
world_population()
输出:
print(df.to_string())
Year (July 1) Population Yearly % Change Yearly Change Median Age Fertility Rate Density (P/Km²) Urban Pop % Urban Population
2020 7,794,798,739 1.05 % 81,330,639 30.9 2.47 52 56.2 % 4,378,993,944
2019 7,713,468,100 1.08 % 82,377,060 29.8 2.51 52 55.7 % 4,299,438,618
0 2018 7631091040 1.10 % 83232115 29.8 2.51 51 55.3 % 4219817318
1 2017 7547858925 1.12 % 83836876 29.8 2.51 51 54.9 % 4140188594
2 2016 7464022049 1.14 % 84224910 29.8 2.51 50 54.4 % 4060652683
3 2015 7379797139 1.19 % 84594707 30.0 2.52 50 54.0 % 3981497663
4 2010 6956823603 1.24 % 82983315 28.0 2.58 47 51.7 % 3594868146
5 2005 6541907027 1.26 % 79682641 27.0 2.65 44 49.2 % 3215905863
6 2000 6143493823 1.35 % 79856169 26.0 2.78 41 46.7 % 2868307513
7 1995 5744212979 1.52 % 83396384 25.0 3.01 39 44.8 % 2575505235
8 1990 5327231061 1.81 % 91261864 24.0 3.44 36 43.0 % 2290228096
9 1985 4870921740 1.79 % 82583645 23.0 3.59 33 41.2 % 2007939063
10 1980 4458003514 1.79 % 75704582 23.0 3.86 30 39.3 % 1754201029
11 1975 4079480606 1.97 % 75808712 22.0 4.47 27 37.7 % 1538624994
12 1970 3700437046 2.07 % 72170690 22.0 4.93 25 36.6 % 1354215496
13 1965 3339583597 1.93 % 60926770 22.0 5.02 22 N.A. N.A.
14 1960 3034949748 1.82 % 52385962 23.0 4.90 20 33.7 % 1023845517
15 1955 2773019936 1.80 % 47317757 23.0 4.97 19 N.A. N.A.