我目前是从beautifulsoup获得此表的,想将其拆分为多个数据帧,我想在每次绿色标题元素出现时都将其拆分。
这是网页: http://www.greyhound-data.com/d?page=stadia&st=1011&land=au&stadiummode=3
这是我现在所拥有的,因为我无法弄清楚,我已经习惯了这些问题,因为它们只是单独的表
url = "http://www.greyhound-data.com/d?page=stadia&st=1011&land=au&stadiummode=3"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')
table = soup.find_all("table", attrs={'id': "green"})
table = table[-1]
df = pd.read_html(str(table))[0]
output:
Year quarter ... Set on
Distance: 331 m / 362 y ... Distance: 331 m / 362 y
0 2020 2nd ... 15 JUN 2020
1 2020 1st ... 23 JAN 2020
2 2019 4th ... 6 OCT 2019
3 2019 3rd ... 1 SEP 2019
4 2019 2nd ... 28 APR 2019
.. ... ... ...
319 2002 3rd ... 5 SEP 2002
320 2002 2nd ... 6 JUN 2002
321 2001 4th ... 18 OCT 2001
322 2001 3rd ... 16 AUG 2001
323 2001 2nd ... 14 JUN 2001
[324 rows x 7 columns]
答案 0 :(得分:1)
此脚本会将表格拆分为几个数据框:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://www.greyhound-data.com/d?page=stadia&st=1011&land=au&stadiummode=3"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')
table = soup.find_all("table", attrs={'id': "green"})[-1]
trs, dfs, all_data = table.select('tr'), [], []
header = [th.get_text(strip=True) for th in trs[0].select('th')]
for tr in trs[2:]:
if tr.td:
all_data.append([td.get_text(strip=True) for td in tr.select('td')])
else:
dfs.append(pd.DataFrame(all_data, columns=header))
all_data = []
dfs.append(pd.DataFrame(all_data, columns=header))
# print all DataFrames in list:
for df in dfs:
print(df)
print('-' * 160)
打印:
Year quarter running dif.dogs average time avg win time best time Set by Set on
0 2020 2nd 226 19.63 19.18 18.79 Data Base 15 JUN 2020
1 2020 1st 255 19.68 19.14 18.58 Wazza Who 23 JAN 2020
.. ... ... ... ... ... ... ...
39 2010 3rd 286 19.85 19.34 18.90 Royal Surfer 15 SEP 2010
40 2010 2nd 92 20.01 19.57 19.28 Paw Form 16 JUN 2010
[41 rows x 7 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Year quarter running dif.dogs average time avg win time best time Set by Set on
0 2020 2nd 217 23.40 22.79 22.25 Canya Cruise 3 JUN 2020
1 2020 1st 285 23.35 22.85 22.47 Dawn's Dream 22 JAN 2020
.. ... ... ... ... ... ... ...
65 2004 1st 3 23.54 23.25 23.25 Seismic Shock 9 JAN 2004
66 2003 4th 16 23.67 23.33 23.29 Far Away Places 17 OCT 2003
[67 rows x 7 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Year quarter running dif.dogs average time avg win time best time Set by Set on
0 2020 2nd 264 30.68 30.13 29.56 Oh Mickey 23 APR 2020
1 2020 1st 224 30.70 30.12 29.41 Sennachie 10 JAN 2020
.. ... ... ... ... ... ... ...
76 2001 2nd 13 30.50 30.37 30.16 Korda 27 APR 2001
77 2001 1st 3 30.72 30.72 30.55 Fly Fast 0 MAR 2001
[78 rows x 7 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Year quarter running dif.dogs average time avg win time best time Set by Set on
0 2020 2nd 76 35.71 35.14 34.65 Frieda Las Vegas 28 MAY 2020
1 2020 1st 76 35.77 35.21 34.72 Velocity Bettina 23 JAN 2020
.. ... ... ... ... ... ... ...
73 2001 2nd 1 35.49 35.49 35.49 Kissin Bobbie 24 MAY 2001
74 2001 1st 1 36.10 36.10 36.10 Brampton Blues 23 MAR 2001
[75 rows x 7 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Year quarter running dif.dogs average time avg win time best time Set by Set on
0 2020 2nd 33 42.73 42.08 41.62 Rasheda 28 MAY 2020
1 2020 1st 16 42.38 41.93 41.83 What About It 20 FEB 2020
.. ... ... ... ... ... ... ...
57 2001 3rd 2 42.57 42.53 42.53 Universal Tears * 16 AUG 2001
58 2001 2nd 4 42.24 42.27 42.15 Hotshow Vintage 14 JUN 2001
[59 rows x 7 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
编辑:也要获取距离列:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://www.greyhound-data.com/d?page=stadia&st=1011&land=au&stadiummode=3"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')
table = soup.find_all("table", attrs={'id': "green"})[-1]
trs, dfs, all_data, th = table.select('tr'), [], [], ''
header = ['Distance'] + [th.get_text(strip=True) for th in trs[0].select('th')]
for tr in trs[1:]:
if tr.td:
all_data.append([th] + [td.get_text(strip=True) for td in tr.select('td')])
else:
th = tr.th.get_text(strip=True)
if all_data:
dfs.append(pd.DataFrame(all_data, columns=header))
all_data = []
dfs.append(pd.DataFrame(all_data, columns=header))
# print all DataFrames in list:
for df in dfs:
print(df)
print('-' * 160)
打印:
Distance Year quarter running dif.dogs average time avg win time best time Set by Set on
0 Distance: 331 m / 362 y 2020 2nd 226 19.63 19.18 18.79 Data Base 15 JUN 2020
1 Distance: 331 m / 362 y 2020 1st 255 19.68 19.14 18.58 Wazza Who 23 JAN 2020
.. ... ... ... ... ... ... ... ...
39 Distance: 331 m / 362 y 2010 3rd 286 19.85 19.34 18.90 Royal Surfer 15 SEP 2010
40 Distance: 331 m / 362 y 2010 2nd 92 20.01 19.57 19.28 Paw Form 16 JUN 2010
[41 rows x 8 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Distance Year quarter running dif.dogs average time avg win time best time Set by Set on
0 Distance: 395 m / 432 y 2020 2nd 217 23.40 22.79 22.25 Canya Cruise 3 JUN 2020
1 Distance: 395 m / 432 y 2020 1st 285 23.35 22.85 22.47 Dawn's Dream 22 JAN 2020
.. ... ... ... ... ... ... ... ...
65 Distance: 395 m / 432 y 2004 1st 3 23.54 23.25 23.25 Seismic Shock 9 JAN 2004
66 Distance: 395 m / 432 y 2003 4th 16 23.67 23.33 23.29 Far Away Places 17 OCT 2003
[67 rows x 8 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Distance Year quarter running dif.dogs average time avg win time best time Set by Set on
0 Distance: 520 m / 569 y 2020 2nd 264 30.68 30.13 29.56 Oh Mickey 23 APR 2020
1 Distance: 520 m / 569 y 2020 1st 224 30.70 30.12 29.41 Sennachie 10 JAN 2020
.. ... ... ... ... ... ... ... ...
76 Distance: 520 m / 569 y 2001 2nd 13 30.50 30.37 30.16 Korda 27 APR 2001
77 Distance: 520 m / 569 y 2001 1st 3 30.72 30.72 30.55 Fly Fast 0 MAR 2001
[78 rows x 8 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Distance Year quarter running dif.dogs average time avg win time best time Set by Set on
0 Distance: 600 m / 656 y 2020 2nd 76 35.71 35.14 34.65 Frieda Las Vegas 28 MAY 2020
1 Distance: 600 m / 656 y 2020 1st 76 35.77 35.21 34.72 Velocity Bettina 23 JAN 2020
.. ... ... ... ... ... ... ... ...
73 Distance: 600 m / 656 y 2001 2nd 1 35.49 35.49 35.49 Kissin Bobbie 24 MAY 2001
74 Distance: 600 m / 656 y 2001 1st 1 36.10 36.10 36.10 Brampton Blues 23 MAR 2001
[75 rows x 8 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Distance Year quarter running dif.dogs average time avg win time best time Set by Set on
0 Distance: 710 m / 776 y 2020 2nd 33 42.73 42.08 41.62 Rasheda 28 MAY 2020
1 Distance: 710 m / 776 y 2020 1st 16 42.38 41.93 41.83 What About It 20 FEB 2020
.. ... ... ... ... ... ... ... ...
57 Distance: 710 m / 776 y 2001 3rd 2 42.57 42.53 42.53 Universal Tears * 16 AUG 2001
58 Distance: 710 m / 776 y 2001 2nd 4 42.24 42.27 42.15 Hotshow Vintage 14 JUN 2001
[59 rows x 8 columns]
----------------------------------------------------------------------------------------------------------------------------------------------------------------