如何仅从<a href,="" which="" includes="" li="" elements,="" using="" beautifulsoup?=""

时间:2018-04-12 08:46:28

标签: python html beautifulsoup

="" I am all new to python and beautifulsoup. I want to get the link form the href. Unfortunately, the anchor also includes other and irrelevant data.

Help is much apreciated

<a href="/link-i-want/to-get.html">
<li class="cat-list-row1 clearfix">
<img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
<h3>
<p class="subline">Do not need</p>	Do not need!				</h3>
<span class="tag-body">
<p>Do not need</p>...				</span>
<div style="clear:both;"></div>
</li>
</a>

1 个答案:

答案 0 :(得分:3)

可以使用[]括号提取属性值。

例如,如果要提取altimg标记,请使用: image_example = soup.find('img')然后print(image_example['alt'])

更新的代码:

from bs4 import BeautifulSoup

data = '''
    <a href="/link-i-want/to-get.html">
    <li class="cat-list-row1 clearfix">
    <img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
    <h3>
    <p class="subline">Do not need</p>  Do not need!                </h3>
    <span class="tag-body">
    <p>Do not need</p>...               </span>
    <div style="clear:both;"></div>
    </li>
    </a>    <a href="/link-i-want/to-get.html">
    <li class="cat-list-row1 clearfix">
    <img align="left" alt="Do not need!" src="https://do.not/need/.jpg" style="margin-right: 20px;" width="40%"/>
    <h3>
    <p class="subline">Do not need</p>  Do not need!                </h3>
    <span class="tag-body">
    <p>Do not need</p>...               </span>
    <div style="clear:both;"></div>
    </li>
    </a>
'''    
soup = BeautifulSoup(data, 'html.parser')
url_address = soup.find('a')['href']
print (url_address) # Output: /link-i-want/to-get.html 

格式如下。 soup.find('<tag>')['<attribute-name>']

我们可以使用提到的.get(attr)soup.find('<tag>').get('<attr>')

参考:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start