抓取篮球成绩并将相关比赛与每场比赛相关

时间:2020-07-30 19:57:24

标签: python web-scraping beautifulsoup python-requests

我想从此网页上抓取篮球成绩: http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29

我使用bs4创建了代码并请求:

url = http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29

with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
    r = session.get(url, timeout=30)
    soup = BeautifulSoup(r.content, 'html.parser')    

我面临的问题是如何在我抓取的每一行中增加竞争 我想创建一个表格,每一行都是比赛结果(比赛,主队,客队,得分...)

2 个答案:

答案 0 :(得分:1)

此页面使用JavaScript加载数据,但是requests / BeautifulSoup无法运行JavaScript

所以您有两个选择。

首先:您可以使用http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000来控制可以运行JavaScript的真实网络浏览器。当页面使用复杂的JavaScript代码生成数据时会更好-但是速度较慢,因为它需要运行必须呈现页面并运行JavaScript的网络浏览器。

第二步:,您可以尝试使用DevTools / Firefox(标签Chrome,过滤器Network)中的XHR查找JavaScript / AJAXXHR)用来从服务器获取数据并将URL与requests一起使用的URL。通常,您可以获得JSON数据,这些数据可以转换为Python列表/字典,因此您不需要BeautifulSoup即可抓取数据。它速度更快,但有时页面使用一些JavaScript代码,而这些代码很难用Python代码替换。


我选择第二种方法。

我发现它从中读取数据

{{3}}

但是它提供XML数据,因此它仍然需要BeautifulSoup(或lxml)来抓取数据。

import requests
from bs4 import BeautifulSoup as BS

url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'

r = requests.get(url)

soup = BS(r.text, 'html.parser')

all_items = soup.find_all('h')

for item in all_items:
    values = item.text.split('^')
    #print(values)
    print(values[8], values[11])
    print(values[10], values[12])
    print('---')

结果:

Portland Trail Blazers 120
Oklahoma City Thunder 131
---
Houston Rockets 137
Boston Celtics 112
---
Philadelphia 76ers 115
Dallas Mavericks 118
---
Connecticut Sun 89
Washington Mystics 94
---
Chicago Sky 96
Los Angeles Sparks 78
---
Seattle Storm 90
Minnesota Lynx 66
---
Labas Pasauli LT 85
Balduasenaras 78
---
BC Vikings 66
Nemuno Banga KK 72
---
NRG Kiev 51
Hizhaki 76
---
Finland 97
Estonia 76
---
Synkarb 82
Sk nemenchine 79
---
CS Sfaxien (w) 51
ES Cap Bon (w) 54
---
Police De La Circulation (w) 43
Etoile Sportive Sahel (w) 39
---
CA Bizertin 63
ES Goulette 71
---
JS Manazeh 77
AS Hammamet 53
---
Southern Huskies 84
Canterbury Rams 98
---
Taranaki Mountainairs 99
Franklin Bulls 90
---
Chaophraya Thunder 67
Thai General Equipment 102
---
Airforce Madgoat Basketball Club 60
HiTech Bangkok City 77
---
Bizoni 82
Leningrad 75
---
chameleon 104
Leningrad 80
---
Bizoni 71
Zubuyu 57
---
Drakony 89
chameleon 79
---
Dragoni 71
Zubuyu 87
---

答案 1 :(得分:1)

尝试一下(硒):

import pandas as pd
from  bs4 import BeautifulSoup
from selenium import webdriver
import time
res =[]
url = 'http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29'
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get(url)
time.sleep(2)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page, 'html.parser')
span = soup.select_one('span#live')
tables = span.select('table')
for table in tables:
    if table.get('class'):
        competition = table.select_one('a b font').text
    else:
        for home, away in zip(table.select('tr.b1')[0::2], table.select('tr.b1')[1::2]):
            res.append([f"{competition}",
                        f"{home.select_one('td a').text}",
                        f"{away.select_one('td a').text}",
                        f"{home.select_one('td.red').text}",
                        f"{away.select_one('td.red').text}",
                        f"{home.select_one('td.odds1').text}",
                        f"{away.select_one('td.odds1').text}",
                        f"{home.select('td font')[0].text}/{home.select('td font')[1].text}",
                        f"{away.select('td font')[0].text}/{away.select('td font')[1].text}",
                        f"{home.select('td div a')[-1].get('href')}"])
df = pd.DataFrame(res, columns=['competition',
                                'home',
                                'away',
                                'home score',
                                'away score',
                                'home odds',
                                'away odds',
                                'home ht',
                                'away ht',
                                'odds'
                                ])

print(df.to_string())
df.to_csv('Res.csv')

打印:

                                 competition                              home                       away home score away score home odds away odds home ht away ht                                                  odds
0            National Basketball Association            Portland Trail Blazers      Oklahoma City Thunder        120        131      2.72      1.45   50/70   63/68  http://data.nowgoal.group/OddsCompBasket/387520.html
1            National Basketball Association                   Houston Rockets             Boston Celtics        137        112      1.49      2.58   77/60   60/52  http://data.nowgoal.group/OddsCompBasket/387521.html
2            National Basketball Association                Philadelphia 76ers           Dallas Mavericks        115        118      2.04      1.76   39/64   48/55  http://data.nowgoal.group/OddsCompBasket/387522.html
3    Women’s National Basketball Association                   Connecticut Sun         Washington Mystics         89         94      2.28      1.59   52/37   48/46  http://data.nowgoal.group/OddsCompBasket/385886.html
4    Women’s National Basketball Association                       Chicago Sky         Los Angeles Sparks         96         78      2.72      1.43   40/56   36/42  http://data.nowgoal.group/OddsCompBasket/385618.html
5    Women’s National Basketball Association                     Seattle Storm             Minnesota Lynx         90         66      1.21      4.19   41/49   35/31  http://data.nowgoal.group/OddsCompBasket/385884.html
6                       Friendly Competition                  Labas Pasauli LT              Balduasenaras         85         78                       52/33   31/47  http://data.nowgoal.group/OddsCompBasket/387769.html
7                       Friendly Competition                        BC Vikings            Nemuno Banga KK         66         72                       29/37   30/42  http://data.nowgoal.group/OddsCompBasket/387771.html
8                       Friendly Competition                          NRG Kiev                    Hizhaki         51         76                       31/20   28/48  http://data.nowgoal.group/OddsCompBasket/387766.html
9                       Friendly Competition                           Finland                    Estonia         97         76      2.77      1.40   48/49   29/47  http://data.nowgoal.group/OddsCompBasket/387740.html
10                      Friendly Competition                           Synkarb              Sk nemenchine         82         79                       37/45   38/41  http://data.nowgoal.group/OddsCompBasket/387770.html

以此类推。...

并保存如下所示的Res.csv

enter image description here


请求

尝试执行此操作(请求):

import pandas as pd
from bs4 import BeautifulSoup
import requests
res = []

url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('h')

for item in items:
    values = item.text.split('^')
    res.append([f'{values[1]}', f'{values[8]}', f'{values[10]}', f'{values[11]}', f'{values[12]}'])
df = pd.DataFrame(res, columns=['competition', 'home', 'away', 'home score', 'away score'])

print(df.to_string())
df.to_csv('Res.csv')

打印:

   competition                              home                       away home score away score
0          NBA            Portland Trail Blazers      Oklahoma City Thunder        120        131
1          NBA                   Houston Rockets             Boston Celtics        137        112
2          NBA                Philadelphia 76ers           Dallas Mavericks        115        118
3         WNBA                   Connecticut Sun         Washington Mystics         89         94
4         WNBA                       Chicago Sky         Los Angeles Sparks         96         78
5         WNBA                     Seattle Storm             Minnesota Lynx         90         66
6           FC                  Labas Pasauli LT              Balduasenaras         85         78
7           FC                        BC Vikings            Nemuno Banga KK         66         72
8           FC                          NRG Kiev                    Hizhaki         51         76

并保存如下所示的Res.csv:

csv

如果您不希望使用索引列,则可以简单地将index=False添加到df.to_csv('Res.csv'),使其看起来像这样df.to_csv('Res.csv', index=False)

注意硒:您需要seleniumgeckodriver,并且在此代码中,将geckodriver设置为从c:/program/geckodriver.exe导入

硒版本较慢,但无需使用XML来获取和找到devtools文件