将表拆分为多个数据框

时间:2020-06-22 08:39:07

标签: python pandas beautifulsoup screen-scraping

我目前是从beautifulsoup获得此表的,想将其拆分为多个数据帧,我想在每次绿色标题元素出现时都将其拆分。

这是网页: http://www.greyhound-data.com/d?page=stadia&st=1011&land=au&stadiummode=3

这是我现在所拥有的,因为我无法弄清楚,我已经习惯了这些问题,因为它们只是单独的表

url = "http://www.greyhound-data.com/d?page=stadia&st=1011&land=au&stadiummode=3"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')


table = soup.find_all("table", attrs={'id': "green"})
table = table[-1]

df = pd.read_html(str(table))[0]

output:

               Year quarter  ...                  Set on
    Distance: 331 m / 362 y  ... Distance: 331 m / 362 y
0                  2020 2nd  ...             15 JUN 2020
1                  2020 1st  ...             23 JAN 2020
2                  2019 4th  ...              6 OCT 2019
3                  2019 3rd  ...              1 SEP 2019
4                  2019 2nd  ...             28 APR 2019
..                      ...  ...                     ...
319                2002 3rd  ...              5 SEP 2002
320                2002 2nd  ...              6 JUN 2002
321                2001 4th  ...             18 OCT 2001
322                2001 3rd  ...             16 AUG 2001
323                2001 2nd  ...             14 JUN 2001

[324 rows x 7 columns]

enter image description here

1 个答案:

答案 0 :(得分:1)

此脚本会将表格拆分为几个数据框:

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = "http://www.greyhound-data.com/d?page=stadia&st=1011&land=au&stadiummode=3"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')

table = soup.find_all("table", attrs={'id': "green"})[-1]

trs, dfs, all_data = table.select('tr'), [], []
header = [th.get_text(strip=True) for th in trs[0].select('th')]

for tr in trs[2:]:
    if tr.td:
        all_data.append([td.get_text(strip=True) for td in tr.select('td')])
    else:
        dfs.append(pd.DataFrame(all_data, columns=header))
        all_data = []
dfs.append(pd.DataFrame(all_data, columns=header))

# print all DataFrames in list:
for df in dfs:
    print(df)
    print('-' * 160)

打印:

   Year quarter running dif.dogs average time avg win time best time        Set by       Set on
0      2020 2nd              226        19.63        19.18     18.79     Data Base  15 JUN 2020
1      2020 1st              255        19.68        19.14     18.58     Wazza Who  23 JAN 2020
..          ...              ...          ...          ...       ...           ...          ...
39     2010 3rd              286        19.85        19.34     18.90  Royal Surfer  15 SEP 2010
40     2010 2nd               92        20.01        19.57     19.28      Paw Form  16 JUN 2010

[41 rows x 7 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
   Year quarter running dif.dogs average time avg win time best time           Set by       Set on
0      2020 2nd              217        23.40        22.79     22.25     Canya Cruise   3 JUN 2020
1      2020 1st              285        23.35        22.85     22.47     Dawn's Dream  22 JAN 2020
..          ...              ...          ...          ...       ...              ...          ...
65     2004 1st                3        23.54        23.25     23.25    Seismic Shock   9 JAN 2004
66     2003 4th               16        23.67        23.33     23.29  Far Away Places  17 OCT 2003

[67 rows x 7 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
   Year quarter running dif.dogs average time avg win time best time     Set by       Set on
0      2020 2nd              264        30.68        30.13     29.56  Oh Mickey  23 APR 2020
1      2020 1st              224        30.70        30.12     29.41  Sennachie  10 JAN 2020
..          ...              ...          ...          ...       ...        ...          ...
76     2001 2nd               13        30.50        30.37     30.16      Korda  27 APR 2001
77     2001 1st                3        30.72        30.72     30.55   Fly Fast   0 MAR 2001

[78 rows x 7 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
   Year quarter running dif.dogs average time avg win time best time            Set by       Set on
0      2020 2nd               76        35.71        35.14     34.65  Frieda Las Vegas  28 MAY 2020
1      2020 1st               76        35.77        35.21     34.72  Velocity Bettina  23 JAN 2020
..          ...              ...          ...          ...       ...               ...          ...
73     2001 2nd                1        35.49        35.49     35.49     Kissin Bobbie  24 MAY 2001
74     2001 1st                1        36.10        36.10     36.10    Brampton Blues  23 MAR 2001

[75 rows x 7 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
   Year quarter running dif.dogs average time avg win time best time             Set by       Set on
0      2020 2nd               33        42.73        42.08     41.62            Rasheda  28 MAY 2020
1      2020 1st               16        42.38        41.93     41.83      What About It  20 FEB 2020
..          ...              ...          ...          ...       ...                ...          ...
57     2001 3rd                2        42.57        42.53     42.53  Universal Tears *  16 AUG 2001
58     2001 2nd                4        42.24        42.27     42.15    Hotshow Vintage  14 JUN 2001

[59 rows x 7 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------

编辑:也要获取距离列:

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = "http://www.greyhound-data.com/d?page=stadia&st=1011&land=au&stadiummode=3"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')

table = soup.find_all("table", attrs={'id': "green"})[-1]

trs, dfs, all_data, th = table.select('tr'), [], [], ''
header = ['Distance'] + [th.get_text(strip=True) for th in trs[0].select('th')]

for tr in trs[1:]:
    if tr.td:
        all_data.append([th] + [td.get_text(strip=True) for td in tr.select('td')])
    else:
        th = tr.th.get_text(strip=True)
        if all_data:
            dfs.append(pd.DataFrame(all_data, columns=header))
            all_data = []

dfs.append(pd.DataFrame(all_data, columns=header))

# print all DataFrames in list:
for df in dfs:
    print(df)
    print('-' * 160)

打印:

                   Distance Year quarter running dif.dogs average time avg win time best time        Set by       Set on
0   Distance: 331 m / 362 y     2020 2nd              226        19.63        19.18     18.79     Data Base  15 JUN 2020
1   Distance: 331 m / 362 y     2020 1st              255        19.68        19.14     18.58     Wazza Who  23 JAN 2020
..                      ...          ...              ...          ...          ...       ...           ...          ...
39  Distance: 331 m / 362 y     2010 3rd              286        19.85        19.34     18.90  Royal Surfer  15 SEP 2010
40  Distance: 331 m / 362 y     2010 2nd               92        20.01        19.57     19.28      Paw Form  16 JUN 2010

[41 rows x 8 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
                   Distance Year quarter running dif.dogs average time avg win time best time           Set by       Set on
0   Distance: 395 m / 432 y     2020 2nd              217        23.40        22.79     22.25     Canya Cruise   3 JUN 2020
1   Distance: 395 m / 432 y     2020 1st              285        23.35        22.85     22.47     Dawn's Dream  22 JAN 2020
..                      ...          ...              ...          ...          ...       ...              ...          ...
65  Distance: 395 m / 432 y     2004 1st                3        23.54        23.25     23.25    Seismic Shock   9 JAN 2004
66  Distance: 395 m / 432 y     2003 4th               16        23.67        23.33     23.29  Far Away Places  17 OCT 2003

[67 rows x 8 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
                   Distance Year quarter running dif.dogs average time avg win time best time     Set by       Set on
0   Distance: 520 m / 569 y     2020 2nd              264        30.68        30.13     29.56  Oh Mickey  23 APR 2020
1   Distance: 520 m / 569 y     2020 1st              224        30.70        30.12     29.41  Sennachie  10 JAN 2020
..                      ...          ...              ...          ...          ...       ...        ...          ...
76  Distance: 520 m / 569 y     2001 2nd               13        30.50        30.37     30.16      Korda  27 APR 2001
77  Distance: 520 m / 569 y     2001 1st                3        30.72        30.72     30.55   Fly Fast   0 MAR 2001

[78 rows x 8 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
                   Distance Year quarter running dif.dogs average time avg win time best time            Set by       Set on
0   Distance: 600 m / 656 y     2020 2nd               76        35.71        35.14     34.65  Frieda Las Vegas  28 MAY 2020
1   Distance: 600 m / 656 y     2020 1st               76        35.77        35.21     34.72  Velocity Bettina  23 JAN 2020
..                      ...          ...              ...          ...          ...       ...               ...          ...
73  Distance: 600 m / 656 y     2001 2nd                1        35.49        35.49     35.49     Kissin Bobbie  24 MAY 2001
74  Distance: 600 m / 656 y     2001 1st                1        36.10        36.10     36.10    Brampton Blues  23 MAR 2001

[75 rows x 8 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
                   Distance Year quarter running dif.dogs average time avg win time best time             Set by       Set on
0   Distance: 710 m / 776 y     2020 2nd               33        42.73        42.08     41.62            Rasheda  28 MAY 2020
1   Distance: 710 m / 776 y     2020 1st               16        42.38        41.93     41.83      What About It  20 FEB 2020
..                      ...          ...              ...          ...          ...       ...                ...          ...
57  Distance: 710 m / 776 y     2001 3rd                2        42.57        42.53     42.53  Universal Tears *  16 AUG 2001
58  Distance: 710 m / 776 y     2001 2nd                4        42.24        42.27     42.15    Hotshow Vintage  14 JUN 2001

[59 rows x 8 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------