I am all new to python and beautifulsoup. I want to get the link form the href. Unfortunately, the anchor also includes other and irrelevant data.
Help is much apreciated
<a href="/link-i-want/to-get.html">
<li class="cat-list-row1 clearfix">
<img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
<h3>
<p class="subline">Do not need</p> Do not need! </h3>
<span class="tag-body">
<p>Do not need</p>... </span>
<div style="clear:both;"></div>
</li>
</a>
答案 0 :(得分:3)
可以使用[]
括号提取属性值。
例如,如果要提取alt
值img
标记,请使用:
image_example = soup.find('img')
然后print(image_example['alt'])
更新的代码:
from bs4 import BeautifulSoup
data = '''
<a href="/link-i-want/to-get.html">
<li class="cat-list-row1 clearfix">
<img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
<h3>
<p class="subline">Do not need</p> Do not need! </h3>
<span class="tag-body">
<p>Do not need</p>... </span>
<div style="clear:both;"></div>
</li>
</a> <a href="/link-i-want/to-get.html">
<li class="cat-list-row1 clearfix">
<img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
<h3>
<p class="subline">Do not need</p> Do not need! </h3>
<span class="tag-body">
<p>Do not need</p>... </span>
<div style="clear:both;"></div>
</li>
</a>
'''
soup = BeautifulSoup(data, 'html.parser')
url_address = soup.find('a')['href']
print (url_address) # Output: /link-i-want/to-get.html
格式如下。
soup.find('<tag>')['<attribute-name>']
。
我们可以使用提到的.get(attr)
。 soup.find('<tag>').get('<attr>')
参考:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start