我正在使用beautifulsoup,有些我怎么也无法在a标签内提取href,无论我做什么都会给我带来错误。这是我正在使用的功能
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news = soup.find_all("div", attrs={"class": "news"})
return news
html数据结构是
<div class="news">
<a href="www.link.com">
<h2 class="heading">
Kenyan police foil potential bomb attack in Nairobi mall
</h2>
<div class="teaserImg">
<img alt="" border="0" height="124" src="/image">
</div>
<p> text </p>
</a>
</div>
我想要从中提取的是href和h2 class =&#39; heading&#39;,每当我尝试获取两者时我都会收到错误无类型对象没有属性获取项目
答案 0 :(得分:0)
这样的事情怎么样?
from bs4 import BeautifulSoup
def get_news_class_hrefs(html):
"""
Finds all urls pointed to by all links inside
'news' class div elements
"""
soup = BeautifulSoup(html, 'html.parser')
links = [a['href'] for div in soup.find_all("div", attrs={"class": "news"}) for a in div.find_all('a')]
return links
# example html copied from question
html="""<div class="news">
<a href="www.link.com">
<h2 class="heading">
Kenyan police foil potential bomb attack in Nairobi mall
</h2>
<div class="teaserImg">
<img alt="" border="0" height="124" src="/image">
</div>
<p> text </p>
</a>"""
get_news_class_hrefs(html)
# Output:
# [u'www.link.com']