我正在尝试使用Wikipedia页面将地区号与芝加哥的社区区域进行匹配:https://en.wikipedia.org/wiki/Community_areas_in_Chicago
我知道如何逐个表地执行此操作,但我相信有一个循环可以使此任务更加容易。
但是,表中未包含区域名称,因此我可能必须以更加手动的方式将它们与联接或字典进行匹配。
下面的代码可以工作,但是它将所有表都刮到一个表中,所以我无法区分“侧面”。
import pandas as pd
df_list = []
for i in range(0, 9):
url_head = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'
df_list.append(pd.read_html(url, header = 0)[i])
df = pd.concat(df_list).drop_duplicates()
主要任务:我想用每个表唯一的附加索引列来刮掉所有表(边名将是完美的)。可以用熊猫吗?
一个小问题:但是有9个分区,但是当我使用(0:8)公式时,最后一张表丢失了,我也不知道为什么。有没有办法使用len这样的值来自动执行此范围?
答案 0 :(得分:0)
带有read_html()
的东西是,当您需要解析<table>
标签时很棒,但是<table>
标签之外的任何东西都不会被抓住。因此,您需要使用BeautifulSoup来更具体地说明如何获取数据。
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all('table')
results_df = pd.DataFrame()
for table in tables:
#table = tables[0]
main_area = table.findPrevious('h3').text.split('[')[0].strip()
try:
sub_area = table.find('caption').text.strip()
except:
sub_area = 'N/A'
rows = table.find_all('tr')
for row in rows:
#row = rows[1]
data = row.find_all('td')
try:
number = data[0].text.strip()
com_area = data[1].text.strip()
n_list = [ each.text.strip() for each in data[2].find_all('li') ]
if n_list == []:
n_list = ['']
for each in n_list:
temp_df = pd.DataFrame([[main_area, sub_area, number, com_area, each]], columns = ['Community area by side', 'Sub community area by side', 'Number', 'Community area', 'Neighborhoods'])
results_df = results_df.append(temp_df).reset_index(drop=True)
except:
continue
输出:
print (results_df.to_string())
Community area by side Sub community area by side Number Community area Neighborhoods
0 Central N/A 08 Near North Side Cabrini–Green
1 Central N/A 08 Near North Side The Gold Coast
2 Central N/A 08 Near North Side Goose Island
3 Central N/A 08 Near North Side Magnificent Mile
4 Central N/A 08 Near North Side Old Town
5 Central N/A 08 Near North Side River North
6 Central N/A 08 Near North Side River West
7 Central N/A 08 Near North Side Streeterville
8 Central N/A 32 Loop Loop
9 Central N/A 32 Loop New Eastside
10 Central N/A 32 Loop South Loop
11 Central N/A 32 Loop West Loop Gate
12 Central N/A 33 Near South Side Dearborn Park
13 Central N/A 33 Near South Side Printer's Row
14 Central N/A 33 Near South Side South Loop
15 Central N/A 33 Near South Side Prairie Avenue Historic District
16 North Side North Side 05 North Center Horner Park
17 North Side North Side 05 North Center Roscoe Village
18 North Side North Side 06 Lake View Boystown
19 North Side North Side 06 Lake View Lake View East
20 North Side North Side 06 Lake View Graceland West
21 North Side North Side 06 Lake View South East Ravenswood
22 North Side North Side 06 Lake View Wrigleyville
23 North Side North Side 07 Lincoln Park Old Town Triangle
24 North Side North Side 07 Lincoln Park Park West
25 North Side North Side 07 Lincoln Park Ranch Triangle
26 North Side North Side 07 Lincoln Park Sheffield Neighbors
27 North Side North Side 07 Lincoln Park Wrightwood Neighbors
28 North Side North Side 21 Avondale Belmont Gardens
29 North Side North Side 21 Avondale Chicago's Polish Village
30 North Side North Side 21 Avondale Kosciuszko Park
31 North Side North Side 22 Logan Square Belmont Gardens
32 North Side North Side 22 Logan Square Bucktown
33 North Side North Side 22 Logan Square Kosciuszko Park
34 North Side North Side 22 Logan Square Palmer Square
35 North Side Far North side 01 Rogers Park East Rogers Park
36 North Side Far North side 02 West Ridge Arcadia Terrace
37 North Side Far North side 02 West Ridge Peterson Park
38 North Side Far North side 02 West Ridge West Rogers Park
39 North Side Far North side 03 Uptown Buena Park
40 North Side Far North side 03 Uptown Argyle Street
41 North Side Far North side 03 Uptown Margate Park
42 North Side Far North side 03 Uptown Sheridan Park
43 North Side Far North side 04 Lincoln Square Ravenswood
44 North Side Far North side 04 Lincoln Square Ravenswood Gardens
...