网页抓取和查找元素

时间:2021-07-16 18:54:18

标签: python web-scraping beautifulsoup parent-child

我试图找出比赛何时推迟并获取相关球队信息或比赛编号,因为我将球队缩写附加到列表中。目前发生的情况是它只获取推迟的项目,并跳过没有推迟的游戏。我想我需要更改soup.select 行,或者做一些稍微不同的事情,但无法弄清楚。

代码没有抛出任何错误,但是返回的列表是[0,1,2,3]。但是,如果您打开 https://www.rotowire.com/baseball/daily-lineups.php,它应该返回 [0,1,14,15],因为这些是游戏推迟的团队元素。

from bs4 import BeautifulSoup
import requests

url = 'https://www.rotowire.com/baseball/daily-lineups.php'

r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

x = 0

gamesRemoved = []

for tag in soup.select(".lineup__main > div"):
    ppcheck = tag.text
    if "POSTPONED" in ppcheck:
        print(x)
        print('Postponement')
        first_team = x*2
        print(first_team)
        gamesRemoved.append(first_team)
        second_team = x*2+1
        gamesRemoved.append(second_team)
        x+=1
        
    else:
        x+=1
        continue
print(gamesRemoved)   

1 个答案:

答案 0 :(得分:2)

您可以使用 BeautifulSoup.select 并检查 'is-postponed' 是否作为类名存在于阵容框中:

from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://www.rotowire.com/baseball/daily-lineups.php').text, 'html.parser')
p = [j for i, a in enumerate(d.select('.lineup.is-mlb')) for j in [i*2, i*2+1] if 'is-postponed' in a['class']]

输出:

[0, 1, 14, 15]