我想在表格的每个按钮内找到文本。我想分别解析每个子表,所以我想知道对应的按钮来分隔每个表。
现在,我有这个:
default_url = 'https://fbref.com'
url = default_url + row['squad_href']
res = requests.get(url)
## The next two lines get around the issue with comments breaking the parsing.
comm = re.compile("<!--|-->")
soup = BeautifulSoup(comm.sub("",res.text),'lxml')
info = soup.findAll("div", {"class": "sub_section_heading"}) #my button class
这将返回:
[<div class="sub_section_heading"> <button class="sr_preset tooltip visible active" data-hide="[id^=all_stats_standard_ks]" data-show="#all_stats_standard_ks_3232" id="button_stats_standard_ks_3232" onclick="setTimeout(function(){sr_st_construct_stats_table_features('stats_standard_ks_3232'); }, 100);" type="button">Premier League</button>
<button class="sr_preset tooltip visible" data-hide="[id^=all_stats_standard_ks]" data-show="#all_stats_standard_ks_2901" id="button_stats_standard_ks_2901" onclick="setTimeout(function(){sr_st_construct_stats_table_features('stats_standard_ks_2901'); }, 100);" type="button">Europa League</button>
<button class="sr_preset tooltip visible" data-hide="[id^=all_stats_standard_ks]" data-show="#all_stats_standard_ks_8833" id="button_stats_standard_ks_8833" onclick="setTimeout(function(){sr_st_construct_stats_table_features('stats_standard_ks_8833'); }, 100);" type="button">EFL Cup</button>
<button class="sr_preset tooltip visible" data-hide="[id^=all_stats_standard_ks]" data-show="#all_stats_standard_ks_5591" id="button_stats_standard_ks_5591" onclick="setTimeout(function(){sr_st_construct_stats_table_features('stats_standard_ks_5591'); }, 100);" type="button">FA Cup</button>
<button class="sr_preset tooltip visible" data-hide="[id^=all_stats_standard_ks]" data-show="#all_stats_standard_ks_combined" id="button_stats_standard_ks_combined" onclick="setTimeout(function(){sr_st_construct_stats_table_features('stats_standard_ks_combined'); }, 100);" type="button">All Competitions</button>
</div>
我想要一个返回带有每个按钮名称的数组的东西,在这种情况下,将是这样的:[ 'Premier League', 'Europa League', 'EFL Cup', 'FA Cup', 'All Competitions']
任何建议都值得赞赏
答案 0 :(得分:0)
您实际上是在对这些div中的按钮中包含的值感兴趣时选择div。第一项工作是获取按钮。为此,我们更改选择
info = soup.select(".sub_section_heading button")
这将获取包含在类.sub_section_heading
中的div中的所有按钮。
您要从此处生成仅包含按钮中包含的文本的列表,此处的一些列表理解会有所帮助。
button_texts = [x.text for x in info]
button_texts
将是仅包含按钮标题的列表,但是如果有多个div,则可能会重复。为了使列表具有独特用途
distinct_texts = list(set(button_texts))
完整代码如下。
default_url = 'https://fbref.com'
url = default_url + row['squad_href']
res = requests.get(url)
## The next two lines get around the issue with comments breaking the parsing.
comm = re.compile("<!--|-->")
soup = BeautifulSoup(comm.sub("",res.text),'lxml')
info = soup.select(".sub_section_heading button")
button_texts = [x.text for x in info]
distinct_texts = list(set(button_texts))