我正在尝试提取BS4中的某些文本。下面的示例HTML。
</tr><tr id="_Gonzaga" class="seedrow">
<td title="Click to show/hide ranks" class='lowrowclick' style="text-align:center;font-size:8px">2</td>
<td id='Gonzaga' class="teamname"><a href="team.php?team=Gonzaga&year=2019" style="text-decoration: none;">Gonzaga<span class="lowrow" style="font-size:10px"><br/> 1 seed, <span style='background-color:#BAE2C6'>Elite Eight</span></span></a></td>
当前代码为:
data = soup.findAll('tr', attrs={"class": "seedrow"})
team_name = item.find('td', class_ = 'teamname')
team_id = team_name.find('a').contents[0]
seed = team_name.find('span').text
print(team_id, seed)
这将返回:
Gonzaga, '\xa0\xa0\xa01 seed, Elite Eight'
我想要什么:
Gonzaga, 1 seed, Elite Eight
答案 0 :(得分:0)
如果我理解正确,您想提取3个单独的字符串。您可以将.get_text()
与自定义separator=
字符一起使用,然后在该字符上分割:
from bs4 import BeautifulSoup
txt = '''
<tr id="_Gonzaga" class="seedrow">
<td title="Click to show/hide ranks" class='lowrowclick' style="text-align:center;font-size:8px">2</td>
<td id='Gonzaga' class="teamname"><a href="team.php?team=Gonzaga&year=2019" style="text-decoration: none;">Gonzaga<span class="lowrow" style="font-size:10px"><br/> 1 seed, <span style='background-color:#BAE2C6'>Elite Eight</span></span></a></td>
</tr>'''
soup = BeautifulSoup(txt, 'html.parser')
data = soup.findAll('tr', attrs={"class": "seedrow"})
for item in data:
team_name = item.find('td', class_ = 'teamname')
a, b, c = team_name.get_text(strip=True, separator='|').split('|')
print(a)
print(b.strip(','))
print(c)
打印:
Gonzaga
1 seed
Elite Eight