如何从div中的标签获取href

时间:2015-09-08 23:36:13

标签: python tags beautifulsoup

我正在使用beautifulsoup,有些我怎么也无法在a标签内提取href,无论我做什么都会给我带来错误。这是我正在使用的功能

def scrape_a(url):
  r = requests.get(url)
  soup = BeautifulSoup(r.content)
  news =  soup.find_all("div", attrs={"class": "news"})
  return news

html数据结构是

<div class="news">
<a href="www.link.com">
<h2 class="heading">
Kenyan police foil potential bomb attack in Nairobi mall 
</h2>
<div class="teaserImg">
<img alt="" border="0" height="124" src="/image">
</div>
<p> text </p>
</a>
</div>

我想要从中提取的是href和h2 class =&#39; heading&#39;,每当我尝试获取两者时我都会收到错误无类型对象没有属性获取项目

1 个答案:

答案 0 :(得分:0)

这样的事情怎么样?

from bs4 import BeautifulSoup

def get_news_class_hrefs(html):
    """
    Finds all urls pointed to by all links inside
    'news' class div elements
    """
    soup = BeautifulSoup(html, 'html.parser')
    links = [a['href'] for div in soup.find_all("div", attrs={"class": "news"}) for a in div.find_all('a')]
    return links

# example html copied from question
html="""<div class="news">
<a href="www.link.com">
<h2 class="heading">
Kenyan police foil potential bomb attack in Nairobi mall 
</h2>
<div class="teaserImg">
<img alt="" border="0" height="124" src="/image">
</div>
<p> text </p>
</a>"""

get_news_class_hrefs(html)
# Output:
# [u'www.link.com']