熊猫数据框架的网页抓取

时间:2020-05-07 12:01:11

标签: python web-scraping beautifulsoup

我正在尝试从https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population中提取项目数据。我正在尝试将来自前20个城市的数据纳入熊猫数据框,如下所示: 排名|城市|纬度|经度

这样一来,我可以在代码的后半部分提取坐标并计算所需的各种参数。到目前为止,这是我想出的,但是似乎失败了:

rank=[]
city=[]
state=[]
population_present=[]
population_past=[]
changepercent=[]


info = requests.get('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population').text
bs = BeautifulSoup(info, 'html.parser')

for row in bs.find('table').find_all('tr'):
    p = row.find_all('td')


for row in bs.find('table').find_all('tr'):
    p= row.find_all('td')
    if(len(p) > 0):
        rank.append(p[0].text)
        city.append(p[1].text)
        latitude.append(p[2].text.rstrip('\n'))

2 个答案:

答案 0 :(得分:1)

您可以通过python pandas进行操作。请尝试以下代码。

import pandas as pd
import requests
from bs4 import BeautifulSoup

info = requests.get('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population').text
bs = BeautifulSoup(info, 'html.parser')
table=bs.find_all('table',class_='wikitable')[1]
df=pd.read_html(str(table))[0]
#Get the first 20 records
df1=df.iloc[:20]

Rank=df1['2018rank'].values.tolist()
City=df1['City'].values.tolist()
#Get the location in list
locationlist=df1['Location'].values.tolist()
Latitude=[]
Longitude=[]
for val in locationlist:
    val1=val.split("/")[-1]
    Latitude.append(val1.split()[0])
    Longitude.append(val1.split()[-1])

df2=pd.DataFrame({"Rank":Rank,"City":City,"Latitude":Latitude,"Longitude":Longitude})
print(df2)

输出

                City    Latitude   Longitude  Rank
0        New York[d]  40.6635°N   73.9387°W     1
1        Los Angeles  34.0194°N  118.4108°W     2
2            Chicago  41.8376°N   87.6818°W     3
3         Houston[3]  29.7866°N   95.3909°W     4
4            Phoenix  33.5722°N  112.0901°W     5
5    Philadelphia[e]  40.0094°N   75.1333°W     6
6        San Antonio  29.4724°N   98.5251°W     7
7          San Diego  32.8153°N  117.1350°W     8
8             Dallas  32.7933°N   96.7665°W     9
9           San Jose  37.2967°N  121.8189°W    10
10            Austin  30.3039°N   97.7544°W    11
11   Jacksonville[f]  30.3369°N   81.6616°W    12
12        Fort Worth  32.7815°N   97.3467°W    13
13          Columbus  39.9852°N   82.9848°W    14
14  San Francisco[g]  37.7272°N  123.0322°W    15
15         Charlotte  35.2078°N   80.8310°W    16
16   Indianapolis[h]  39.7767°N   86.1459°W    17
17           Seattle  47.6205°N  122.3509°W    18
18         Denver[i]  39.7619°N  104.8811°W    19
19     Washington[j]  38.9041°N   77.0172°W    20

答案 1 :(得分:0)

您正在从网页中访问错误的元素。要使用所需数据访问表,请使用以下方法:

info = requests.get('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population').text
bs = BeautifulSoup(info, 'html.parser')

for tr in bs.findAll('table')[4].findAll('tr'):
    # Now take the data from this row that you want, and put it in a DataFrame