我该如何抓取以下结构,以仅使h5字符串“ Prem League”上方的h3,h4类和h5字符串“ Prem League”下方的div class =“ fixres_item”直接获得。
我想要来自h3,h4和div的文本,我需要来自span,span内的文本
因此,当h5类字符串是Prem League时,我希望将h4和h3直接放在上方,并且还需要在h5类字符串= Prem League下方直接查找fixres_item的各种元素
<div class="fixres__body" data-url="" data-view="fixture-update" data-controller="fixture-update" data-fn="live-refresh" data-sport="football" data-lite="true" id="widgetLite-6">
<h3 class="fixres__header1">November 2018</h3>
<h4 class="fixres__header2">Saturday 24th November</h4>
<h5 class="fixres__header3">Prem League</h5>
<div class="fixres__item">stuff in here</div>
<h4 class="fixres__header2">Wednesday 28th November</h4>
<h5 class="fixres__header3">UEFA Champ League</h5>
<div class="fixres__item">stuff in here</div>
<h3 class="fixres__header1">December 2018</h3>
<h4 class="fixres__header2">Sunday 2nd December</h4>
<h5 class="fixres__header3">Prem League</h5>
<div class="fixres__item">stuff in here</div>
这是我到目前为止的代码,但这包括来自h5字符串“ EUFA Champ League”以下的div的数据-我不希望这样。我只想要h5以下标题为“ Prem League”的div中的数据。例如,我不希望输出PSG,因为它来自h5以下标题“ EUFA Champ League”的div
我的代码-
def squad_fixtures():
team_table = ['https://someurl.com/liverpool-fixtures']
for i in team_table:
# team_fixture_urls = [i.replace('-squad', '-fixtures') for i in team_table]
squad_r = requests.get(i)
premier_squad_soup = BeautifulSoup(squad_r.text, 'html.parser')
# print(premier_squad_soup)
premier_fix_body = premier_squad_soup.find('div', {'class': 'fixres__body'})
# print(premier_fix_body)
premier_fix_divs = premier_fix_body.find_all('div', {'class': 'fixres__item'})
for i in premier_fix_divs:
team_home = i.find_all('span', {'class': 'matches__item-col matches__participant matches__participant--side1'})
for i in team_home:
team_home_names = i.find('span', {'class': 'swap-text--bp30'})['title']
team_home_namesall.append(team_home_names)
print(team_home_namesall)
输出
['Watford','PSG','Liverpool','Burnley','B'mouth','Liverpool','Liverpool','Wolves','Liverpool','Liverpool','Man City','Brighton ”,“利物浦”,“利物浦”,“西汉姆”,“利物浦”,“曼联”,“利物浦”,“埃弗顿”,“利物浦”,“富勒姆”,“利物浦”,“苏顿”, “利物浦”,“加的夫”,“利物浦”,“纽卡斯尔”,“利物浦”]
答案 0 :(得分:1)
您面临的挑战似乎是将抓取内容限制为Premier League
<h5>
及其相关内容。
注意:您的问题指出
>string
中的h5
应该是Prem League
,但实际上,当我查看响应时,它似乎是Premier League
。
此HTML看起来很平整,结构上没有差异,因此看起来最好的方法是浏览h5的上一个和下一个兄弟姐妹,这本身很容易找到:
import re
from bs4 import BeautifulSoup, Tag
import requests
prem_league_regex = re.compile(r"Premier League")
def squad_fixtures():
team_table = ['https://www.skysports.com/liverpool-fixtures']
for i in team_table:
squad_r = requests.get(i)
soup = BeautifulSoup(squad_r.text, 'html.parser')
body = soup.find('div', {'class': 'fixres__body'})
h5s = body.find_all('h5', {'class': 'fixres__header3'}, text=prem_league_regex)
for h5 in h5s:
prev_tag = find_previous(h5)
if prev_tag.name == 'h4':
print(prev_tag.text)
prev_tag = find_previous(prev_tag)
if prev_tag.name == 'h3':
print(prev_tag.text)
fixres_item_div = find_next(h5)
"""
get the things you need from fixres__item now that you have it...
"""
def find_previous(tag):
prev_tag = tag.previous_sibling
while(not isinstance(prev_tag, Tag)):
prev_tag = prev_tag.previous_sibling
return prev_tag
def find_next(tag):
next_tag = tag.next_sibling
while(not isinstance(next_tag, Tag)):
next_tag = next_tag.next_sibling
return next_tag