使用Beautifulsoup从不明确的标签中抓取

时间:2019-06-07 21:42:15

标签: html web-scraping beautifulsoup

我想提取个人参加的大学。在下面的特定代码中,我在运行soup.find_all()以返回所有标签后得到的,这所大学是Auburn。我知道标记标记HTML文档中的组件类型。因此,在这种情况下,我要查找的相关标签是否为

<a href='../College..."? 

如果是这样,我将如何使用BeautifulSoup返回大学名称?

<img height="75" id="CollegeLogo" 
id="CollegeCommit" style="color: white; font-size: 22px; 
text-decoration: underline dotted">Auburn</a>
</div>

1 个答案:

答案 0 :(得分:0)

使用ID

from bs4 import BeautifulSoup as bs

html = '''
<img height="75" id="ContentPlaceHolder1_img4yearCollegeLogo" 
onerror="this.style.display='none'" src="https://5409b91eba8a3c695263- 
e57580eaf7522c9542febdac7b28f14a.ssl.cf1.rackcdn.com/1566.png"/>
</p><div class="Five"></div>
<a href="../College/CollegeCommitments.aspx?Grad=2012&amp;college=1566" 
id="ContentPlaceHolder1_hl4yearCommit" style="color: white; font-size: 22px; 
text-decoration: underline dotted">Auburn</a>
</div>
'''
soup = bs(html, 'lxml')
href = soup.select_one('#ContentPlaceHolder1_hl4yearCommit')['href']

效率较低,但是您可以使用* contains, ^ starts with, or $ ends with operator例如,对属性进行子字符串匹配。

href = soup.select_one('[id$=yearCommit]')['href']