我正在尝试向下钻取并使用beautifulsoup4获取div的src和href。我已经阅读了文档观看教程和搜索堆栈的帖子,但没有找到一个。继承人html代码
<div class="thumbnail thumb">
<h6 id="date">May 2, 2016</h6>
<img src="http://www.viveca.net/wp-content/uploads/2012/02/End_of_the_Line025.jpg" class="img-responsive post">
<div style="border-bottom: thin solid lightslategray; padding-bottom: 15px;"></div>
<div class="caption" id="cap">
<a href="/blog/just-filler/">
<h5 class="post-title" id="title">just filler</h5>
</a>
<p>
<a href="/blog/36/delete/" class="btn" role="button">delete</a>
<a href="/blog/just-filler/edit/" class="btn" role="button">edit</a>
</p>
</div>
</div>
我试过这个
entries = [{'text': div.text,
'href': div.get('div', {'class', 'thumbnail'}).a,
'src': div.get('src')
} for div in divs]
但它不起作用
我在我的django应用程序中使用此功能。什么是刮掉href和src的正确语法。文本的工作原理不是src和href。
答案 0 :(得分:2)
BeautifulSoup可能有更智能,内置的方式,但这似乎有效:
from bs4 import BeautifulSoup as soup
html = """
<div class="thumbnail thumb">
<h6 id="date">May 2, 2016</h6>
<img src="http://www.viveca.net/wp-content/uploads/2012/02/End_of_the_Line025.jpg" class="img-responsive post">
<div style="border-bottom: thin solid lightslategray; padding-bottom: 15px;"></div>
<div class="caption" id="cap">
<a href="/blog/just-filler/">
<h5 class="post-title" id="title">just filler</h5>
</a>
<p>
<a href="/blog/36/delete/" class="btn" role="button">delete</a>
<a href="/blog/just-filler/edit/" class="btn" role="button">edit</a>
</p>
</div>
</div>
"""
parsed = soup(html, "html.parser")
divs = parsed.find_all("div")
entries = [{'text': div.text,
'src' : map(lambda img : img.get("src"), div.find_all('img')),
'href': map(lambda a : a.get("href"), div.find_all('a'))
} for div in divs if "thumbnail" in div.get("class", [])]
print entries
输出:
[{'text': u'\nMay 2, 2016\n\n\n\n\njust filler\n\n\ndelete\nedit\n\n\n', 'href': [u'/blog/just-filler/', u'/blog/36/delete/', u'/blog/just-filler/edit/'], 'src': [u'http://www.viveca.net/wp-content/uploads/2012/02/End_of_the_Line025.jpg']}]
答案 1 :(得分:0)
这很有用 在我看来
entries = [{'text': div.text,
'href': div.find('a').get('href'),
'src': div.find('img').get('src')
} for div in divs]
并在我的模板中
{% for e in entries %}
<a href="{{url}}{{ e.href }}" class="thumbnail">{{ e.text }}</a><br>
{{e.href}}<br>