美丽汤在div中查找按钮

时间:2020-03-08 18:29:44

标签: python beautifulsoup

我想在表格的每个按钮内找到文本。我想分别解析每个子表,所以我想知道对应的按钮来分隔每个表。

现在,我有这个:

default_url = 'https://fbref.com'
url = default_url + row['squad_href']
res = requests.get(url)
## The next two lines get around the issue with comments breaking the parsing.
comm = re.compile("<!--|-->")
soup = BeautifulSoup(comm.sub("",res.text),'lxml')
info = soup.findAll("div", {"class": "sub_section_heading"}) #my button class

这将返回:

[<div class="sub_section_heading"> <button class="sr_preset tooltip visible active" data-hide="[id^=all_stats_standard_ks]" data-show="#all_stats_standard_ks_3232" id="button_stats_standard_ks_3232" onclick="setTimeout(function(){sr_st_construct_stats_table_features('stats_standard_ks_3232'); }, 100);" type="button">Premier League</button>
<button class="sr_preset tooltip visible" data-hide="[id^=all_stats_standard_ks]" data-show="#all_stats_standard_ks_2901" id="button_stats_standard_ks_2901" onclick="setTimeout(function(){sr_st_construct_stats_table_features('stats_standard_ks_2901'); }, 100);" type="button">Europa League</button>
<button class="sr_preset tooltip visible" data-hide="[id^=all_stats_standard_ks]" data-show="#all_stats_standard_ks_8833" id="button_stats_standard_ks_8833" onclick="setTimeout(function(){sr_st_construct_stats_table_features('stats_standard_ks_8833'); }, 100);" type="button">EFL Cup</button>
<button class="sr_preset tooltip visible" data-hide="[id^=all_stats_standard_ks]" data-show="#all_stats_standard_ks_5591" id="button_stats_standard_ks_5591" onclick="setTimeout(function(){sr_st_construct_stats_table_features('stats_standard_ks_5591'); }, 100);" type="button">FA Cup</button>
<button class="sr_preset tooltip visible" data-hide="[id^=all_stats_standard_ks]" data-show="#all_stats_standard_ks_combined" id="button_stats_standard_ks_combined" onclick="setTimeout(function(){sr_st_construct_stats_table_features('stats_standard_ks_combined'); }, 100);" type="button">All Competitions</button>
</div>

我想要一个返回带有每个按钮名称的数组的东西,在这种情况下,将是这样的:[ 'Premier League', 'Europa League', 'EFL Cup', 'FA Cup', 'All Competitions']

任何建议都值得赞赏

1 个答案:

答案 0 :(得分:0)

您实际上是在对这些div中的按钮中包含的值感兴趣时选择div。第一项工作是获取按钮。为此,我们更改选择

info = soup.select(".sub_section_heading button")

这将获取包含在类.sub_section_heading中的div中的所有按钮。

您要从此处生成仅包含按钮中包含的文本的列表,此处的一些列表理解会有所帮助。

button_texts = [x.text for x in info]

button_texts将是仅包含按钮标题的列表,但是如果有多个div,则可能会重复。为了使列表具有独特用途

distinct_texts = list(set(button_texts))

完整代码如下。

default_url = 'https://fbref.com'
url = default_url + row['squad_href']
res = requests.get(url)
## The next two lines get around the issue with comments breaking the parsing.
comm = re.compile("<!--|-->")
soup = BeautifulSoup(comm.sub("",res.text),'lxml')
info = soup.select(".sub_section_heading button")
button_texts = [x.text for x in info]
distinct_texts = list(set(button_texts))