在Python中使用BeautifulSoup和Requests进行表格数据抓取

时间:2020-09-02 10:48:10

标签: python-3.x pandas web-scraping beautifulsoup python-requests

我正在尝试使用beautifulsoup和请求从以下站点抓取表格数据: https://www.worldometers.info/world-population/

运行代码时,我会发现这种错误:

> Traceback (most recent call last):   File
> "d:\python\population\worldpop.py", line 16, in <dictcomp>
>     result=[{ header[index]:cells.text for index,cells in enumerate(row.find_all('td'))} for row in    rows_data] IndexError:
> list index out of range

当然,我知道在访问超出范围的项目时会发生这种类型的错误,但是对于此特定问题,我却遇到了麻烦。 我期待这个问题的适当解决方案。

#worknig关于从worldometers.info抓取的表格数据并将其转换为csv文件。

from bs4 import BeautifulSoup
import requests
import pandas

url='https://www.worldometers.info/world-population/'

def world_population():
    page=requests.get(url)
    soup=BeautifulSoup(page.content,'html.parser')
    pop_data=soup.find('table', class_='table table-striped table-bordered table-hover table-condensed 
    table-list')
    header=[heading.text for heading in pop_data.find_all('th')]
    #print(header)
    rows_data=[row for row in pop_data.find_all('tr')]

    result=[{ header[index]:cells.text for index,cells in enumerate(row.find_all('td'))} for row in 
    rows_data]
    
    df=pandas.DataFrame(result)
    df.to_csv('pop.csv')

world_population()


    

1 个答案:

答案 0 :(得分:0)

您可以使用熊猫的.read_html()来解析<table>标签。它将返回给您表的列表,作为数据框的列表。然后,只需从索引值中拉出所需的表即可。

import requests
import pandas as pd

url='https://www.worldometers.info/world-population/'

def world_population():
    page=requests.get(url)
    df = pd.read_html(page.text)[0]
    df.to_csv('pop.csv')

world_population()

输出:

print(df.to_string())
   Year (July 1)    Population Yearly % Change Yearly Change Median Age Fertility Rate Density (P/Km²) Urban Pop % Urban Population
            2020 7,794,798,739          1.05 %    81,330,639       30.9           2.47              52      56.2 %    4,378,993,944
            2019 7,713,468,100          1.08 %    82,377,060       29.8           2.51              52      55.7 %    4,299,438,618
0           2018    7631091040          1.10 %      83232115       29.8           2.51              51      55.3 %       4219817318
1           2017    7547858925          1.12 %      83836876       29.8           2.51              51      54.9 %       4140188594
2           2016    7464022049          1.14 %      84224910       29.8           2.51              50      54.4 %       4060652683
3           2015    7379797139          1.19 %      84594707       30.0           2.52              50      54.0 %       3981497663
4           2010    6956823603          1.24 %      82983315       28.0           2.58              47      51.7 %       3594868146
5           2005    6541907027          1.26 %      79682641       27.0           2.65              44      49.2 %       3215905863
6           2000    6143493823          1.35 %      79856169       26.0           2.78              41      46.7 %       2868307513
7           1995    5744212979          1.52 %      83396384       25.0           3.01              39      44.8 %       2575505235
8           1990    5327231061          1.81 %      91261864       24.0           3.44              36      43.0 %       2290228096
9           1985    4870921740          1.79 %      82583645       23.0           3.59              33      41.2 %       2007939063
10          1980    4458003514          1.79 %      75704582       23.0           3.86              30      39.3 %       1754201029
11          1975    4079480606          1.97 %      75808712       22.0           4.47              27      37.7 %       1538624994
12          1970    3700437046          2.07 %      72170690       22.0           4.93              25      36.6 %       1354215496
13          1965    3339583597          1.93 %      60926770       22.0           5.02              22        N.A.             N.A.
14          1960    3034949748          1.82 %      52385962       23.0           4.90              20      33.7 %       1023845517
15          1955    2773019936          1.80 %      47317757       23.0           4.97              19        N.A.             N.A.