如何在Python中从一页抓取多个表并将它们编入索引?

时间:2019-05-23 12:14:41

标签: python python-3.x pandas loops web-scraping

我正在尝试使用Wikipedia页面将地区号与芝加哥的社区区域进行匹配:https://en.wikipedia.org/wiki/Community_areas_in_Chicago

我知道如何逐个表地执行此操作,但我相信有一个循环可以使此任务更加容易。

但是,表中未包含区域名称,因此我可能必须以更加手动的方式将它们与联接或字典进行匹配。

下面的代码可以工作,但是它将所有表都刮到一个表中,所以我无法区分“侧面”。

import pandas as pd

df_list = []
for i in range(0, 9): 
    url_head = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago' 
    df_list.append(pd.read_html(url, header = 0)[i])

df = pd.concat(df_list).drop_duplicates()
  1. 主要任务:我想用每个表唯一的附加索引列来刮掉所有表(边名将是完美的)。可以用熊猫吗?

  2. 一个小问题:但是有9个分区,但是当我使用(0:8)公式时,最后一张表丢失了,我也不知道为什么。有没有办法使用len这样的值来自动执行此范围?

1 个答案:

答案 0 :(得分:0)

带有read_html()的东西是,当您需要解析<table>标签时很棒,但是<table>标签之外的任何东西都不会被抓住。因此,您需要使用BeautifulSoup来更具体地说明如何获取数据。

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

tables = soup.find_all('table')
results_df = pd.DataFrame()
for table in tables:
    #table = tables[0]
    main_area = table.findPrevious('h3').text.split('[')[0].strip()

    try:
        sub_area = table.find('caption').text.strip()
    except:
        sub_area = 'N/A'

    rows = table.find_all('tr')
    for row in rows:
        #row = rows[1]
        data = row.find_all('td')

        try:    
            number = data[0].text.strip()
            com_area = data[1].text.strip()

            n_list = [ each.text.strip() for each in data[2].find_all('li') ]
            if n_list == []:
                n_list = ['']

            for each in n_list:
                temp_df = pd.DataFrame([[main_area, sub_area, number, com_area, each]], columns = ['Community area by side', 'Sub community area by side', 'Number', 'Community area', 'Neighborhoods'])

                results_df = results_df.append(temp_df).reset_index(drop=True)
        except:
            continue

输出:

print (results_df.to_string())
    Community area by side Sub community area by side Number          Community area                     Neighborhoods
0                  Central                        N/A     08         Near North Side                     Cabrini–Green
1                  Central                        N/A     08         Near North Side                    The Gold Coast
2                  Central                        N/A     08         Near North Side                      Goose Island
3                  Central                        N/A     08         Near North Side                  Magnificent Mile
4                  Central                        N/A     08         Near North Side                          Old Town
5                  Central                        N/A     08         Near North Side                       River North
6                  Central                        N/A     08         Near North Side                        River West
7                  Central                        N/A     08         Near North Side                     Streeterville
8                  Central                        N/A     32                    Loop                              Loop
9                  Central                        N/A     32                    Loop                      New Eastside
10                 Central                        N/A     32                    Loop                        South Loop
11                 Central                        N/A     32                    Loop                    West Loop Gate
12                 Central                        N/A     33         Near South Side                     Dearborn Park
13                 Central                        N/A     33         Near South Side                     Printer's Row
14                 Central                        N/A     33         Near South Side                        South Loop
15                 Central                        N/A     33         Near South Side  Prairie Avenue Historic District
16              North Side                 North Side     05            North Center                       Horner Park
17              North Side                 North Side     05            North Center                    Roscoe Village
18              North Side                 North Side     06               Lake View                          Boystown
19              North Side                 North Side     06               Lake View                    Lake View East
20              North Side                 North Side     06               Lake View                    Graceland West
21              North Side                 North Side     06               Lake View             South East Ravenswood
22              North Side                 North Side     06               Lake View                      Wrigleyville
23              North Side                 North Side     07            Lincoln Park                 Old Town Triangle
24              North Side                 North Side     07            Lincoln Park                         Park West
25              North Side                 North Side     07            Lincoln Park                    Ranch Triangle
26              North Side                 North Side     07            Lincoln Park               Sheffield Neighbors
27              North Side                 North Side     07            Lincoln Park              Wrightwood Neighbors
28              North Side                 North Side     21                Avondale                   Belmont Gardens
29              North Side                 North Side     21                Avondale          Chicago's Polish Village
30              North Side                 North Side     21                Avondale                   Kosciuszko Park
31              North Side                 North Side     22            Logan Square                   Belmont Gardens
32              North Side                 North Side     22            Logan Square                          Bucktown
33              North Side                 North Side     22            Logan Square                   Kosciuszko Park
34              North Side                 North Side     22            Logan Square                     Palmer Square
35              North Side             Far North side     01             Rogers Park                  East Rogers Park
36              North Side             Far North side     02              West Ridge                   Arcadia Terrace
37              North Side             Far North side     02              West Ridge                     Peterson Park
38              North Side             Far North side     02              West Ridge                  West Rogers Park
39              North Side             Far North side     03                  Uptown                        Buena Park
40              North Side             Far North side     03                  Uptown                     Argyle Street
41              North Side             Far North side     03                  Uptown                      Margate Park
42              North Side             Far North side     03                  Uptown                     Sheridan Park
43              North Side             Far North side     04          Lincoln Square                        Ravenswood
44              North Side             Far North side     04          Lincoln Square                Ravenswood Gardens 
...