使用请求抓取动态页面

时间:2021-05-21 15:12:51

标签: python web-scraping python-requests

我开始用 Selenium https://www.flashscore.com/ 抓取这个网站,但是这个过程非常缓慢,因为我必须抓取数千个网址,所以我寻找了一种更快的请求方法

import requests
from bs4 import BeautifulSoup
import json
import re


url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script',{'type':"text/javascript"})
for script in scripts:
    if 'window.environment =' in str(script):
        scriptStr = str(script)
        jsonMatch = re.compile("{.*}")
        jsonStr = jsonMatch.search(scriptStr)[0]
        jsonData = json.loads(jsonStr)

fsign = jsonData['config']['app']['feed_sign']
headers.update({'x-fsign':fsign})
url = "https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB"

response = requests.get(url, headers=headers)

print(response.status_code)
print(response.text.strip())

输出

SE÷Match¬~SG÷Ball Possession¬SH÷41%¬SI÷59%¬~SG÷Goal Attempts¬SH÷10¬SI÷20¬~SG÷Shots on Goal¬SH÷4¬SI÷3¬~SG÷Shots off Goal¬SH÷3¬SI÷9¬~SG÷Blocked Shots¬SH÷3¬SI÷8¬~SG÷Free Kicks¬SH÷8¬SI÷11¬~SG÷Corner Kicks¬SH÷6¬SI÷7¬~SG÷Offsides¬SH÷1¬SI÷1¬~SG÷Throw-in¬SH÷15¬SI÷15¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷4¬~SG÷Fouls¬SH÷10¬SI÷7¬~SG÷Total Passes¬SH÷389¬SI÷584¬~SG÷Tackles¬SH÷14¬SI÷15¬~SG÷Attacks¬SH÷107¬SI÷105¬~SG÷Dangerous Attacks¬SH÷77¬SI÷53¬~SE÷1st Half¬~SG÷Ball Possession¬SH÷37%¬SI÷63%¬~SG÷Goal Attempts¬SH÷5¬SI÷13¬~SG÷Shots on Goal¬SH÷1¬SI÷1¬~SG÷Shots off Goal¬SH÷2¬SI÷7¬~SG÷Blocked Shots¬SH÷2¬SI÷5¬~SG÷Free Kicks¬SH÷2¬SI÷5¬~SG÷Corner Kicks¬SH÷2¬SI÷2¬~SG÷Offsides¬SH÷0¬SI÷0¬~SG÷Throw-in¬SH÷10¬SI÷9¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷1¬~SG÷Fouls¬SH÷5¬SI÷2¬~SG÷Total Passes¬SH÷188¬SI÷331¬~SG÷Tackles¬SH÷8¬SI÷10¬~SG÷Attacks¬SH÷50¬SI÷61¬~SG÷Dangerous Attacks¬SH÷35¬SI÷29¬~SE÷2nd Half¬~SG÷Ball Possession¬SH÷45%¬SI÷55%¬~SG÷Goal Attempts¬SH÷5¬SI÷7¬~SG÷Shots on Goal¬SH÷3¬SI÷2¬~SG÷Shots off Goal¬SH÷1¬SI÷2¬~SG÷Blocked Shots¬SH÷1¬SI÷3¬~SG÷Free Kicks¬SH÷6¬SI÷6¬~SG÷Corner Kicks¬SH÷4¬SI÷5¬~SG÷Offsides¬SH÷1¬SI÷1¬~SG÷Throw-in¬SH÷5¬SI÷6¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷3¬~SG÷Fouls¬SH÷5¬SI÷5¬~SG÷Total Passes¬SH÷201¬SI÷253¬~SG÷Tackles¬SH÷6¬SI÷5¬~SG÷Attacks¬SH÷57¬SI÷44¬~SG÷Dangerous Attacks¬SH÷42¬SI÷24¬~A1÷¬~

使用此代码,我可以访问包含特定格式统计信息的网址,但是,如何从该文件中抓取数据并获取其他统计信息,例如球队、分数和日期时间?

即将被抓取的网址是这样的https://www.flashscore.com/match/tE4RoHzB/#match-summary/match-summary

对模式进行一些更改

Match
Stat Ball Possession Home 41% Away 59%
Stat Goal Attempts Home 10 Away 20
Stat Shots on Goal Home 4 Away 3
Stat Shots off Goal Home 3 Away 9
Stat Blocked Shots Home 3 Away 8
Stat Free Kicks Home 8 Away 11
Stat Corner Kicks Home 6 Away 7
Stat Offsides Home 1 Away 1
Stat Throw-in Home 15 Away 15
Stat Goalkeeper Saves Home 0 Away 4
Stat Fouls Home 10 Away 7
Stat Total Passes Home 389 Away 584
Stat Tackles Home 14 Away 15
Stat Attacks Home 107 Away 105
Stat Dangerous Attacks Home 77 Away 53
1st Half
Stat Ball Possession Home 37% Away 63%
Stat Goal Attempts Home 5 Away 13
Stat Shots on Goal Home 1 Away 1
Stat Shots off Goal Home 2 Away 7
Stat Blocked Shots Home 2 Away 5
Stat Free Kicks Home 2 Away 5
Stat Corner Kicks Home 2 Away 2
Stat Offsides Home 0 Away 0
Stat Throw-in Home 10 Away 9
Stat Goalkeeper Saves Home 0 Away 1
Stat Fouls Home 5 Away 2
Stat Total Passes Home 188 Away 331
Stat Tackles Home 8 Away 10
Stat Attacks Home 50 Away 61
Stat Dangerous Attacks Home 35 Away 29
2nd Half
Stat Ball Possession Home 45% Away 55%
Stat Goal Attempts Home 5 Away 7
Stat Shots on Goal Home 3 Away 2
Stat Shots off Goal Home 1 Away 2
Stat Blocked Shots Home 1 Away 3
Stat Free Kicks Home 6 Away 6
Stat Corner Kicks Home 4 Away 5
Stat Offsides Home 1 Away 1
Stat Throw-in Home 5 Away 6
Stat Goalkeeper Saves Home 0 Away 3
Stat Fouls Home 5 Away 5
Stat Total Passes Home 201 Away 253
Stat Tackles Home 6 Away 5
Stat Attacks Home 57 Away 44
Stat Dangerous Attacks Home 42 Away 24

1 个答案:

答案 0 :(得分:0)

当我观察 Network 中的 DevTools 时(使用过滤器 XHR 和文本过滤的地址 _1_
然后我在不同的网址中看到其他值

比赛总结

静电

阵型和首发阵容

评论

它们都给出了奇怪的字符串,但我在字符串中看到了一些模式:

~ 表示换行
¬ 按行拆分项目
÷ 将项目拆分为 name,value

如果我用它来重新格式化数据,那么它看起来更具可读性,但它仍然需要在列表和字典中组织它。每个 url 都需要自己的代码。所以我跳过这部分。

对于Statistics


最少的工作代码:

import requests
from bs4 import BeautifulSoup
import json
import re

def display(text):
    text = text.strip()
    for line in text.split('~'):
        items = line.split('¬')
        for item in items:
            parts = item.split('÷')
            print('>', '|'.join(parts))
        
url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}

s = requests.Session()
s.headers.update(headers)

response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script', {'type':"text/javascript"})
for script in scripts:
    if 'window.environment =' in str(script):
        scriptStr = str(script)
        jsonMatch = re.compile("{.*}")
        jsonStr = jsonMatch.search(scriptStr)[0]
        jsonData = json.loads(jsonStr)


fsign = jsonData['config']['app']['feed_sign']

s.headers.update({'x-fsign':fsign})
                 
                 
print('--- Match Summary ---')
url = 'https://d.flashscore.com/x/feed/dc_1_tE4RoHzB'
response = s.get(url)
display(response.text)

url = 'https://d.flashscore.com/x/feed/df_sui_1_tE4RoHzB'
response = s.get(url)
display(response.text)

url = 'https://d.flashscore.com/x/feed/df_dos_1_tE4RoHzB_'
response = s.get(url)
display(response.text)

print('--- Statictics ---')
url = 'https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB'
response = s.get(url)
display(response.text)

print('--- Formation & Starting Lineups ---')
url = 'https://d.flashscore.com/x/feed/df_scr_1_tE4RoHzB'  # OK
response = s.get(url)
display(response.text)

url = 'https://d.flashscore.com/x/feed/df_li_1_tE4RoHzB'  # OK
response = s.get(url)
display(response.text)

print('--- Comments ---')
url = "https://d.flashscore.com/x/feed/df_lc_1_tE4RoHzB"  # OK
response = s.get(url, headers={'x-fsign':fsign})
display(response.text)

结果(部分)

--- Match Summary ---

> AC|1st Half
> IG|0
> IH|1
> 
> III|nTADt7er
> IA|2
> IB|43'
> IE|3
> IF|Firmino R.
> IU|/player/firmino-roberto/CCNtplZe/
> ICT|Goal! Andrew Robertson tees up Roberto Firmino<br />(Liverpool) inside the box, and he keeps<br />his cool to find the bottom right corner.<br />0:1.
> IK|Goal
> IM|CCNtplZe
> IN|452994
> IO|
> IE|8
> IF|Robertson A.
> IU|/player/robertson-andrew/6e7Be9VI/
> ICT|
> IK|Assistance
> IM|6e7Be9VI

--- Statictics ---
> SE|Match
> 
> SG|Ball Possession
> SH|41%
> SI|59%
> 
> SG|Goal Attempts
> SH|10
> SI|20
> 
> SG|Shots on Goal
> SH|4
> SI|3
> 
> SG|Shots off Goal
> SH|3
> SI|9
> 
> SG|Blocked Shots
> SH|3
> SI|8
> 

--- Formation & Starting Lineups ---
> SPT|1
> SPI|bDmUiRg3
> SPF|199
> SPG|Scotland
> SPR|/player/bardsley-phillip/bDmUiRg3/
> SPN|Bardsley P.
> SPC|1
> SPU|0
> SPE|Hernia
> SPD|There is some chance of playing.
> 
> SPT|1
> SPI|GjF2pGwT
> SPF|96
> SPG|Ireland
> SPR|/player/brady-robbie/GjF2pGwT/
> SPN|Brady R.
> SPC|1
> SPU|0
> SPE|Calf Injury
> SPD|There is some chance of playing.
> 


--- Comments ---
> MA|
> 
> MB|90+4'
> MK|90:00 +3:29
> MC|whistle
> MD|There will be no more action in this match as the referee signals full time.
> MF|1
> MG|740
> MH|https://media-content-enetpulse.secure.footprint.net/gallery/2021/5/19/7df4d412740f558c20283d63a180c964o2.jpg
> 
> MB|90+4'
> MK|90:00 +3:20
> MC|corner
> MD|Liverpool failed to take advantage of the corner as the opposition's defence was alert and averted the threat. Liverpool are still threatening though, as it's a corner.
> 
> MB|90+3'
> MK|90:00 +2:50
> MC|
> MD|Mohamed Salah (Liverpool) skips past his man but can't keep the ball in play. Liverpool earn a corner.
> ME|1
> MF|1
> 

编辑:

Statisctics 格式转换为 Pandas DataFrame 的版本

import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd

def display(text):
    text = text.strip()
    for line in text.split('~'):
        items = line.split('¬')
        for item in items:
            parts = item.split('÷')
            print('>', '|'.join(parts))

def format_statisctic(text):
    text = text.strip()

    data = []
    
    row = []

    match_part = '' # to remember if it is full match or 1st/2nd halg

    for line in text.split('~'):
        items = line.split('¬')
        for item in items:
            parts = item.split('÷')

            # remember     
            if parts[0] == 'SE':
                match_part = parts[1]

            # create row with data
            if parts[0] in ('SG', 'SH', 'SI'):
                row.append(parts[1])

            # add row to data with `match_part`
            if len(row) == 3:
                data.append([match_part] + row)
                # empty row for new data
                row = []

    # convert all to DataFrame
    df = pd.DataFrame(data, columns=['Part', 'Stat', 'SH', 'SI'])
    
    print(df)

# -------------------------
        
url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}

s = requests.Session()
s.headers.update(headers)

response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script', {'type':"text/javascript"})
for script in scripts:
    if 'window.environment =' in str(script):
        scriptStr = str(script)
        jsonMatch = re.compile("{.*}")
        jsonStr = jsonMatch.search(scriptStr)[0]
        jsonData = json.loads(jsonStr)


fsign = jsonData['config']['app']['feed_sign']

s.headers.update({'x-fsign':fsign})

print('--- Statictics ---')
url = 'https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB'
response = s.get(url)
display(response.text)
text_statictics = response.text
format_statisctic(text_statictics)

结果:

        Part               Stat   SH   SI
0      Match    Ball Possession  41%  59%
1      Match      Goal Attempts   10   20
2      Match      Shots on Goal    4    3
3      Match     Shots off Goal    3    9
4      Match      Blocked Shots    3    8
5      Match         Free Kicks    8   11
6      Match       Corner Kicks    6    7
7      Match           Offsides    1    1
8      Match           Throw-in   15   15
9      Match   Goalkeeper Saves    0    4
10     Match              Fouls   10    7
11     Match       Total Passes  389  584
12     Match            Tackles   14   15
13     Match            Attacks  107  105
14     Match  Dangerous Attacks   77   53
15  1st Half    Ball Possession  37%  63%
16  1st Half      Goal Attempts    5   13
17  1st Half      Shots on Goal    1    1
18  1st Half     Shots off Goal    2    7
19  1st Half      Blocked Shots    2    5
20  1st Half         Free Kicks    2    5
21  1st Half       Corner Kicks    2    2
22  1st Half           Offsides    0    0
23  1st Half           Throw-in   10    9
24  1st Half   Goalkeeper Saves    0    1
25  1st Half              Fouls    5    2
26  1st Half       Total Passes  188  331
27  1st Half            Tackles    8   10
28  1st Half            Attacks   50   61
29  1st Half  Dangerous Attacks   35   29
30  2nd Half    Ball Possession  45%  55%
31  2nd Half      Goal Attempts    5    7
32  2nd Half      Shots on Goal    3    2
33  2nd Half     Shots off Goal    1    2
34  2nd Half      Blocked Shots    1    3
35  2nd Half         Free Kicks    6    6
36  2nd Half       Corner Kicks    4    5
37  2nd Half           Offsides    1    1
38  2nd Half           Throw-in    5    6
39  2nd Half   Goalkeeper Saves    0    3
40  2nd Half              Fouls    5    5
41  2nd Half       Total Passes  201  253
42  2nd Half            Tackles    6    5
43  2nd Half            Attacks   57   44
44  2nd Half  Dangerous Attacks   42   24