我开始用 Selenium https://www.flashscore.com/ 抓取这个网站,但是这个过程非常缓慢,因为我必须抓取数千个网址,所以我寻找了一种更快的请求方法
import requests
from bs4 import BeautifulSoup
import json
import re
url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script',{'type':"text/javascript"})
for script in scripts:
if 'window.environment =' in str(script):
scriptStr = str(script)
jsonMatch = re.compile("{.*}")
jsonStr = jsonMatch.search(scriptStr)[0]
jsonData = json.loads(jsonStr)
fsign = jsonData['config']['app']['feed_sign']
headers.update({'x-fsign':fsign})
url = "https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB"
response = requests.get(url, headers=headers)
print(response.status_code)
print(response.text.strip())
输出
SE÷Match¬~SG÷Ball Possession¬SH÷41%¬SI÷59%¬~SG÷Goal Attempts¬SH÷10¬SI÷20¬~SG÷Shots on Goal¬SH÷4¬SI÷3¬~SG÷Shots off Goal¬SH÷3¬SI÷9¬~SG÷Blocked Shots¬SH÷3¬SI÷8¬~SG÷Free Kicks¬SH÷8¬SI÷11¬~SG÷Corner Kicks¬SH÷6¬SI÷7¬~SG÷Offsides¬SH÷1¬SI÷1¬~SG÷Throw-in¬SH÷15¬SI÷15¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷4¬~SG÷Fouls¬SH÷10¬SI÷7¬~SG÷Total Passes¬SH÷389¬SI÷584¬~SG÷Tackles¬SH÷14¬SI÷15¬~SG÷Attacks¬SH÷107¬SI÷105¬~SG÷Dangerous Attacks¬SH÷77¬SI÷53¬~SE÷1st Half¬~SG÷Ball Possession¬SH÷37%¬SI÷63%¬~SG÷Goal Attempts¬SH÷5¬SI÷13¬~SG÷Shots on Goal¬SH÷1¬SI÷1¬~SG÷Shots off Goal¬SH÷2¬SI÷7¬~SG÷Blocked Shots¬SH÷2¬SI÷5¬~SG÷Free Kicks¬SH÷2¬SI÷5¬~SG÷Corner Kicks¬SH÷2¬SI÷2¬~SG÷Offsides¬SH÷0¬SI÷0¬~SG÷Throw-in¬SH÷10¬SI÷9¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷1¬~SG÷Fouls¬SH÷5¬SI÷2¬~SG÷Total Passes¬SH÷188¬SI÷331¬~SG÷Tackles¬SH÷8¬SI÷10¬~SG÷Attacks¬SH÷50¬SI÷61¬~SG÷Dangerous Attacks¬SH÷35¬SI÷29¬~SE÷2nd Half¬~SG÷Ball Possession¬SH÷45%¬SI÷55%¬~SG÷Goal Attempts¬SH÷5¬SI÷7¬~SG÷Shots on Goal¬SH÷3¬SI÷2¬~SG÷Shots off Goal¬SH÷1¬SI÷2¬~SG÷Blocked Shots¬SH÷1¬SI÷3¬~SG÷Free Kicks¬SH÷6¬SI÷6¬~SG÷Corner Kicks¬SH÷4¬SI÷5¬~SG÷Offsides¬SH÷1¬SI÷1¬~SG÷Throw-in¬SH÷5¬SI÷6¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷3¬~SG÷Fouls¬SH÷5¬SI÷5¬~SG÷Total Passes¬SH÷201¬SI÷253¬~SG÷Tackles¬SH÷6¬SI÷5¬~SG÷Attacks¬SH÷57¬SI÷44¬~SG÷Dangerous Attacks¬SH÷42¬SI÷24¬~A1÷¬~
使用此代码,我可以访问包含特定格式统计信息的网址,但是,如何从该文件中抓取数据并获取其他统计信息,例如球队、分数和日期时间?
即将被抓取的网址是这样的https://www.flashscore.com/match/tE4RoHzB/#match-summary/match-summary
对模式进行一些更改
Match
Stat Ball Possession Home 41% Away 59%
Stat Goal Attempts Home 10 Away 20
Stat Shots on Goal Home 4 Away 3
Stat Shots off Goal Home 3 Away 9
Stat Blocked Shots Home 3 Away 8
Stat Free Kicks Home 8 Away 11
Stat Corner Kicks Home 6 Away 7
Stat Offsides Home 1 Away 1
Stat Throw-in Home 15 Away 15
Stat Goalkeeper Saves Home 0 Away 4
Stat Fouls Home 10 Away 7
Stat Total Passes Home 389 Away 584
Stat Tackles Home 14 Away 15
Stat Attacks Home 107 Away 105
Stat Dangerous Attacks Home 77 Away 53
1st Half
Stat Ball Possession Home 37% Away 63%
Stat Goal Attempts Home 5 Away 13
Stat Shots on Goal Home 1 Away 1
Stat Shots off Goal Home 2 Away 7
Stat Blocked Shots Home 2 Away 5
Stat Free Kicks Home 2 Away 5
Stat Corner Kicks Home 2 Away 2
Stat Offsides Home 0 Away 0
Stat Throw-in Home 10 Away 9
Stat Goalkeeper Saves Home 0 Away 1
Stat Fouls Home 5 Away 2
Stat Total Passes Home 188 Away 331
Stat Tackles Home 8 Away 10
Stat Attacks Home 50 Away 61
Stat Dangerous Attacks Home 35 Away 29
2nd Half
Stat Ball Possession Home 45% Away 55%
Stat Goal Attempts Home 5 Away 7
Stat Shots on Goal Home 3 Away 2
Stat Shots off Goal Home 1 Away 2
Stat Blocked Shots Home 1 Away 3
Stat Free Kicks Home 6 Away 6
Stat Corner Kicks Home 4 Away 5
Stat Offsides Home 1 Away 1
Stat Throw-in Home 5 Away 6
Stat Goalkeeper Saves Home 0 Away 3
Stat Fouls Home 5 Away 5
Stat Total Passes Home 201 Away 253
Stat Tackles Home 6 Away 5
Stat Attacks Home 57 Away 44
Stat Dangerous Attacks Home 42 Away 24
答案 0 :(得分:0)
当我观察 Network
中的 DevTools
时(使用过滤器 XHR
和文本过滤的地址 _1_
)
然后我在不同的网址中看到其他值
比赛总结
静电
阵型和首发阵容
评论
它们都给出了奇怪的字符串,但我在字符串中看到了一些模式:
~
表示换行
¬
按行拆分项目
÷
将项目拆分为 name,value
如果我用它来重新格式化数据,那么它看起来更具可读性,但它仍然需要在列表和字典中组织它。每个 url 都需要自己的代码。所以我跳过这部分。
对于Statistics
最少的工作代码:
import requests
from bs4 import BeautifulSoup
import json
import re
def display(text):
text = text.strip()
for line in text.split('~'):
items = line.split('¬')
for item in items:
parts = item.split('÷')
print('>', '|'.join(parts))
url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
s = requests.Session()
s.headers.update(headers)
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script', {'type':"text/javascript"})
for script in scripts:
if 'window.environment =' in str(script):
scriptStr = str(script)
jsonMatch = re.compile("{.*}")
jsonStr = jsonMatch.search(scriptStr)[0]
jsonData = json.loads(jsonStr)
fsign = jsonData['config']['app']['feed_sign']
s.headers.update({'x-fsign':fsign})
print('--- Match Summary ---')
url = 'https://d.flashscore.com/x/feed/dc_1_tE4RoHzB'
response = s.get(url)
display(response.text)
url = 'https://d.flashscore.com/x/feed/df_sui_1_tE4RoHzB'
response = s.get(url)
display(response.text)
url = 'https://d.flashscore.com/x/feed/df_dos_1_tE4RoHzB_'
response = s.get(url)
display(response.text)
print('--- Statictics ---')
url = 'https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB'
response = s.get(url)
display(response.text)
print('--- Formation & Starting Lineups ---')
url = 'https://d.flashscore.com/x/feed/df_scr_1_tE4RoHzB' # OK
response = s.get(url)
display(response.text)
url = 'https://d.flashscore.com/x/feed/df_li_1_tE4RoHzB' # OK
response = s.get(url)
display(response.text)
print('--- Comments ---')
url = "https://d.flashscore.com/x/feed/df_lc_1_tE4RoHzB" # OK
response = s.get(url, headers={'x-fsign':fsign})
display(response.text)
结果(部分)
--- Match Summary ---
> AC|1st Half
> IG|0
> IH|1
>
> III|nTADt7er
> IA|2
> IB|43'
> IE|3
> IF|Firmino R.
> IU|/player/firmino-roberto/CCNtplZe/
> ICT|Goal! Andrew Robertson tees up Roberto Firmino<br />(Liverpool) inside the box, and he keeps<br />his cool to find the bottom right corner.<br />0:1.
> IK|Goal
> IM|CCNtplZe
> IN|452994
> IO|
> IE|8
> IF|Robertson A.
> IU|/player/robertson-andrew/6e7Be9VI/
> ICT|
> IK|Assistance
> IM|6e7Be9VI
--- Statictics ---
> SE|Match
>
> SG|Ball Possession
> SH|41%
> SI|59%
>
> SG|Goal Attempts
> SH|10
> SI|20
>
> SG|Shots on Goal
> SH|4
> SI|3
>
> SG|Shots off Goal
> SH|3
> SI|9
>
> SG|Blocked Shots
> SH|3
> SI|8
>
--- Formation & Starting Lineups ---
> SPT|1
> SPI|bDmUiRg3
> SPF|199
> SPG|Scotland
> SPR|/player/bardsley-phillip/bDmUiRg3/
> SPN|Bardsley P.
> SPC|1
> SPU|0
> SPE|Hernia
> SPD|There is some chance of playing.
>
> SPT|1
> SPI|GjF2pGwT
> SPF|96
> SPG|Ireland
> SPR|/player/brady-robbie/GjF2pGwT/
> SPN|Brady R.
> SPC|1
> SPU|0
> SPE|Calf Injury
> SPD|There is some chance of playing.
>
--- Comments ---
> MA|
>
> MB|90+4'
> MK|90:00 +3:29
> MC|whistle
> MD|There will be no more action in this match as the referee signals full time.
> MF|1
> MG|740
> MH|https://media-content-enetpulse.secure.footprint.net/gallery/2021/5/19/7df4d412740f558c20283d63a180c964o2.jpg
>
> MB|90+4'
> MK|90:00 +3:20
> MC|corner
> MD|Liverpool failed to take advantage of the corner as the opposition's defence was alert and averted the threat. Liverpool are still threatening though, as it's a corner.
>
> MB|90+3'
> MK|90:00 +2:50
> MC|
> MD|Mohamed Salah (Liverpool) skips past his man but can't keep the ball in play. Liverpool earn a corner.
> ME|1
> MF|1
>
编辑:
将 Statisctics
格式转换为 Pandas DataFrame 的版本
import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd
def display(text):
text = text.strip()
for line in text.split('~'):
items = line.split('¬')
for item in items:
parts = item.split('÷')
print('>', '|'.join(parts))
def format_statisctic(text):
text = text.strip()
data = []
row = []
match_part = '' # to remember if it is full match or 1st/2nd halg
for line in text.split('~'):
items = line.split('¬')
for item in items:
parts = item.split('÷')
# remember
if parts[0] == 'SE':
match_part = parts[1]
# create row with data
if parts[0] in ('SG', 'SH', 'SI'):
row.append(parts[1])
# add row to data with `match_part`
if len(row) == 3:
data.append([match_part] + row)
# empty row for new data
row = []
# convert all to DataFrame
df = pd.DataFrame(data, columns=['Part', 'Stat', 'SH', 'SI'])
print(df)
# -------------------------
url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
s = requests.Session()
s.headers.update(headers)
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script', {'type':"text/javascript"})
for script in scripts:
if 'window.environment =' in str(script):
scriptStr = str(script)
jsonMatch = re.compile("{.*}")
jsonStr = jsonMatch.search(scriptStr)[0]
jsonData = json.loads(jsonStr)
fsign = jsonData['config']['app']['feed_sign']
s.headers.update({'x-fsign':fsign})
print('--- Statictics ---')
url = 'https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB'
response = s.get(url)
display(response.text)
text_statictics = response.text
format_statisctic(text_statictics)
结果:
Part Stat SH SI
0 Match Ball Possession 41% 59%
1 Match Goal Attempts 10 20
2 Match Shots on Goal 4 3
3 Match Shots off Goal 3 9
4 Match Blocked Shots 3 8
5 Match Free Kicks 8 11
6 Match Corner Kicks 6 7
7 Match Offsides 1 1
8 Match Throw-in 15 15
9 Match Goalkeeper Saves 0 4
10 Match Fouls 10 7
11 Match Total Passes 389 584
12 Match Tackles 14 15
13 Match Attacks 107 105
14 Match Dangerous Attacks 77 53
15 1st Half Ball Possession 37% 63%
16 1st Half Goal Attempts 5 13
17 1st Half Shots on Goal 1 1
18 1st Half Shots off Goal 2 7
19 1st Half Blocked Shots 2 5
20 1st Half Free Kicks 2 5
21 1st Half Corner Kicks 2 2
22 1st Half Offsides 0 0
23 1st Half Throw-in 10 9
24 1st Half Goalkeeper Saves 0 1
25 1st Half Fouls 5 2
26 1st Half Total Passes 188 331
27 1st Half Tackles 8 10
28 1st Half Attacks 50 61
29 1st Half Dangerous Attacks 35 29
30 2nd Half Ball Possession 45% 55%
31 2nd Half Goal Attempts 5 7
32 2nd Half Shots on Goal 3 2
33 2nd Half Shots off Goal 1 2
34 2nd Half Blocked Shots 1 3
35 2nd Half Free Kicks 6 6
36 2nd Half Corner Kicks 4 5
37 2nd Half Offsides 1 1
38 2nd Half Throw-in 5 6
39 2nd Half Goalkeeper Saves 0 3
40 2nd Half Fouls 5 5
41 2nd Half Total Passes 201 253
42 2nd Half Tackles 6 5
43 2nd Half Attacks 57 44
44 2nd Half Dangerous Attacks 42 24