我想抓取this website的网球比赛结果
我想要的结果表具有以下列:tournament_name match_time player_1 player_2 player_1_score player_2_score
这是一个例子
tournament_name match_time player_1 player_2 p1_set1 p2_set1
Roma / Italy 11:00 Krajinovic Filip Auger Aliassime Felix 6 4
Iasi (IX) / Romania 10:00 Bourgue Mathias Martineau Matteo 6 1
我无法将id="main_tour"
上的每个锦标赛名称与每一行相关联(一行是2 class="match"
或2 class="match1"
我尝试了以下代码:
import requests
from bs4 import BeautifulSoup
u = "http://www.tennisprediction.com/?year=2020&month=9&day=14"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(u, timeout=30, headers=headers)
# print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
for table in soup.select('#main_tur'):
tourn_value = [i.get_text(strip=True) for i in table.select('tr:nth-child(1)')][0].split('/')[0].strip()
tourn_name = [i.get_text(strip=True) for i in table.select('tr td#main_tour')]
row = [i.get_text(strip=True) for i in table.select('.match')]
row2 = [i.get_text(strip=True) for i in table.select('.match1')]
print(tourn_value, tourn_name)
答案 0 :(得分:1)
您可以使用此脚本以您的格式将表保存为CSV:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for t in soup.select('.main_time'):
p1 = t.find_next(class_='main_player')
p2 = p1.find_next(class_='main_player')
tour = t.find_previous(id='main_tour')
scores1 = {'player_1_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.select('.main_res')), 1)}
scores2 = {'player_2_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.find_next_sibling().select('.main_res')), 1)}
all_data.append({
'tournament_name': ' / '.join( a.text for a in tour.select('a') ),
'match_time': t.text,
'player_1': p1.get_text(strip=True, separator=' '),
'player_2': p2.get_text(strip=True, separator=' '),
})
all_data[-1].update(scores1)
all_data[-1].update(scores2)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
保存data.csv
:
编辑:要为两个玩家添加奇数,概率栏:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for t in soup.select('.main_time'):
p1 = t.find_next(class_='main_player')
p2 = p1.find_next(class_='main_player')
tour = t.find_previous(id='main_tour')
odd1 = t.find_next(class_='main_odds_m')
odd2 = t.parent.find_next_sibling().find_next(class_='main_odds_m')
prob1 = t.find_next(class_='main_perc')
prob2 = t.parent.find_next_sibling().find_next(class_='main_perc')
scores1 = {'player_1_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.select('.main_res')), 1)}
scores2 = {'player_2_set{}'.format(i): s for i, s in enumerate((tag.get_text(strip=True) for tag in t.parent.find_next_sibling().select('.main_res')), 1)}
all_data.append({
'tournament_name': ' / '.join( a.text for a in tour.select('a') ),
'match_time': t.text,
'player_1': p1.get_text(strip=True, separator=' '),
'player_2': p2.get_text(strip=True, separator=' '),
'odd1': odd1.text,
'prob1': prob1.text,
'odd2': odd2.text,
'prob2': prob2.text
})
all_data[-1].update(scores1)
all_data[-1].update(scores2)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
答案 1 :(得分:1)
Andrej的解决方案确实非常优雅。接受他的解决方案,但这是我的努力:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.tennisprediction.com/?year=2020&month=9&day=14'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
rows=[]
for matchClass in ['match','match1']:
matches = soup.find_all('tr',{'class':'match'})
for idx, match in enumerate(matches):
if idx%2 != 0:
continue
row = {}
tourny = match.find_previous('td',{'id':'main_tour'}).text
time = match.find('td',{'class':'main_time'}).text
p1 = match.find('td',{'class':'main_player'})
player_1 = p1.text
row.update({'tournament_name':tourny,'match_time':time,'player_1':player_1})
sets = p1.find_previous('tr',{'class':'match'}).find_all('td',{'class':'main_res'})
for idx,each_set in enumerate(sets):
row.update({'p1_set%d'%(idx+1):each_set.text})
p2 = match.find_next('td',{'class':'main_player'})
player_2 = p2.text
row.update({'player_2':player_2})
sets = p2.find_next('tr',{'class':'match'}).find_all('td',{'class':'main_res'})
for idx,each_set in enumerate(sets):
row.update({'p2_set%d'%(idx+1):each_set.text})
rows.append(row)
df = pd.DataFrame(rows)
输出:
print(df.head(10).to_string())
tournament_name match_time player_1 p1_set1 p1_set2 p1_set3 p1_set4 p1_set5 player_2 p2_set1 p2_set2 p2_set3 p2_set4 p2_set5
0 Roma / Italy prize / money : 5791 000 USD 11:10 Krajinovic Filip (SRB) (26) 6 7 Krajinovic Filip (SRB) (26) 4 5
1 Roma / Italy prize / money : 5791 000 USD 13:15 Dimitrov Grigor (BGR) (20) 7 6 Dimitrov Grigor (BGR) (20) 5 1
2 Roma / Italy prize / money : 5791 000 USD 13:50 Coric Borna (HRV) (32) 6 6 Coric Borna (HRV) (32) 4 4
3 Roma / Italy prize / money : 5791 000 USD 15:30 Humbert Ugo (FRA) (42) 6 7 Humbert Ugo (FRA) (42) 3 6 (5)
4 Roma / Italy prize / money : 5791 000 USD 19:00 Nishikori Kei (JPN) (34) 6 7 Nishikori Kei (JPN) (34) 4 6 (3)
5 Roma / Italy prize / money : 5791 000 USD 22:00 Travaglia Stefano (ITA) (87) 6 7 Travaglia Stefano (ITA) (87) 4 6 (4)
6 Iasi (IX) / Romania prize / money : 100 000 USD 10:05 Menezes Joao (BRA) (189) 6 6 Menezes Joao (BRA) (189) 4 4
7 Iasi (IX) / Romania prize / money : 100 000 USD 12:05 Cretu Cezar (2001) (ROU) 2 6 6 Cretu Cezar (2001) (ROU) 6 3 4
8 Iasi (IX) / Romania prize / money : 100 000 USD 14:35 Zuk Kacper (POL) (306) 6 6 Zuk Kacper (POL) (306) 2 0
9 Roma / Italy prize / money : 3452 000 USD 11:05 Pavlyuchenkova Anastasia (RUS) (32) 6 6 6 Pavlyuchenkova Anastasia (RUS) (32) 4 7 (5) 1