我如何抓取此html?
<h3>
<span class="method">GET </span>
[/r/
<em class="placeholder">subreddit</em>
]/api/user_flair
<span class="oauth-scope-list"><a href="https://github.com/reddit/reddit/wiki/OAuth2"><span class="api-badge oauth-scope">flair</span></a>
</span>
</h3>
是否有任何方法可在span标签下获取文本。我知道使用next
或next_sibling
可以得到下一个文本。但是是否还有其他解决办法,例如h3.span
答案 0 :(得分:1)
这样您就可以抓到text
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<h3>
<span class="method">GET </span>
[/r/
<em class="placeholder">subreddit</em>
]/api/user_flair
<span class="oauth-scope-list"><a href="https://github.com/reddit/reddit/wiki/OAuth2"><span class="api-badge oauth-scope">flair</span></a>
</span>
</h3>""")
api_badges = soup.find_all('span', {'class': 'api-badge oauth-scope'})
api_badges_txt = [api_badge.text for api_badge in api_badges]
输出为
['flair']
如果您使用
add_space = soup.find('em').next_sibling.replace('\n', '').strip()
soup.find('h3').get_text(strip=True).replace(add_space, add_space + ' ')
您得到'GET[/r/subreddit]/api/user_flair flair'