我正试图从this网站上搜集一下。我的目标是收集任何团队的最新10个结果(赢/输/抽奖),我只是以这个特定团队为例。单个行的来源是:
<tr class="odd match no-date-repetition" data-timestamp="1515864600" id="page_team_1_block_team_matches_3_match-2463021" data-competition="8">
<td class="day no-repetition">Sat</td>
<td class="full-date" nowrap="nowrap">13/01/18</td>
<td class="competition"><a href="/national/england/premier-league/20172018/regular-season/r41547/" title="Premier League">PRL</a></td>
<td class="team team-a ">
<a href="/teams/england/tottenham-hotspur-football-club/675/" title="Tottenham Hotspur">
Tottenham Hotspur
</a>
</td>
<td class="score-time score">
<a href="/matches/2018/01/13/england/premier-league/tottenham-hotspur-football-club/everton-football-club/2463021/" class="result-win">
4 - 0
</a>
</td>
<td class="team team-b ">
<a href="/teams/england/everton-football-club/674/" title="Everton">
Everton
</a>
</td>
<td class="events-button button first-occur">
<a href="/matches/2018/01/13/england/premier-league/tottenham-hotspur-football-club/everton-football-club/2463021/#events" title="View events" class="events-button-button ">View events</a>
</td>
<td class="info-button button">
<a href="/matches/2018/01/13/england/premier-league/tottenham-hotspur-football-club/everton-football-club/2463021/" title="More info">More info</a>
</td>
</tr>
您可以在<td class="score-time score"
中看到,结果已存储。
我对Python和网络爬行的了解非常有限,所以我目前的代码是:
res2 = requests.get(soccerwayURL)
soup2 = bs4.BeautifulSoup(res2.text, 'html.parser')
elems2 = soup2.select('#page_team_1_block_team_matches_3_match-2463021 > td.score-time.score')
print(elems2[0].text.strip())
这打印出'4-0'。这很好,但是当我尝试访问另一行时会出现问题。 7位数字(上例中的2463021)对于该行是唯一的。这意味着如果我想从不同的行获得分数,我将不得不找到唯一的7位数字并将其放在CSS选择器'#page_team_1_block_team_matches_3_match-******* > td.score-time.score'
中,其中星号是唯一的数字。
我参加的在线课程仅展示了如何通过CSS选择器引用内容,因此我不确定如何在不手动为每行选择CSS选择器的情况下检索分数。
在<td class="score-time score">
类中,还有另一个类读取class="result-win">
。理想情况下,我希望能够提取"result-win"
,因为我不是在寻找比赛的得分,我只是在寻找胜负,失败或平局的结果。
我希望这篇文章很清楚。我的知识有限,所以如果我的词汇与某些技术术语不完全相符,我会道歉。
我的客观声明是:“从Soccerway网站上的任何一个团队中检索最近的10个结果(赢,输,抽奖)。”
答案 0 :(得分:0)
from bs4 import BeautifulSoup
import requests
import urllib3
#Had some security issues. Had to disable it. Be careful!
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
#Need to disable verifying ssl.Be careful!
r = requests.get('https://us.soccerway.com/teams/england/tottenham-hotspur-football-club/675/matches/',verify=False)
soup = BeautifulSoup(r.text, 'lxml')
matches = soup.find('table',{'class':'matches'}).find('tbody')
i = 0
for row in matches.find_all('tr'):
#For first ten result
if i == 10:
break
else:
i +=1
data = row.find_all('td')
home_team = data[3].text.strip()
match_result = data[4].text.strip()
match_result_class = data[4].find('a').attrs['class'][0]
away_team = data[5].text.strip()
output = str.format('Home team : {0}, Away team : {1}, Match Result Class :{2}',home_team,away_team,match_result_class)
print(output)
输出
Home team : Newcastle United, Away team : Tottenham Hotspur, Match Result Class :result-win
Home team : Tottenham Hotspur, Away team : Chelsea, Match Result Class :result-loss
Home team : Tottenham Hotspur, Away team : Burnley, Match Result Class :result-draw
Home team : Everton, Away team : Tottenham Hotspur, Match Result Class :result-win
Home team : Tottenham Hotspur, Away team : Borussia Dortmund, Match Result Class :result-win
Home team : Tottenham Hotspur, Away team : Swansea City, Match Result Class :result-draw
Home team : Tottenham Hotspur, Away team : Barnsley, Match Result Class :result-win
Home team : West Ham United, Away team : Tottenham Hotspur, Match Result Class :result-win
Home team : APOEL, Away team : Tottenham Hotspur, Match Result Class :result-win
Home team : Huddersfield Town, Away team : Tottenham Hotspur, Match Result Class :result-win