我如何向下钻取并使用beautifulsoup4获取div的src和href

时间:2016-05-05 17:00:29

标签: html django python-3.x beautifulsoup

我正在尝试向下钻取并使用beautifulsoup4获取div的src和href。我已经阅读了文档观看教程和搜索堆栈的帖子,但没有找到一个。继承人html代码

<div class="thumbnail thumb">
     <h6 id="date">May 2, 2016</h6>
         <img src="http://www.viveca.net/wp-content/uploads/2012/02/End_of_the_Line025.jpg" class="img-responsive post">

                <div style="border-bottom: thin solid lightslategray; padding-bottom: 15px;"></div>

                <div class="caption" id="cap">
                    <a href="/blog/just-filler/">
                        <h5 class="post-title" id="title">just filler</h5>
                    </a>

                    <p>
                        <a href="/blog/36/delete/" class="btn" role="button">delete</a>
                        <a href="/blog/just-filler/edit/" class="btn" role="button">edit</a>
                    </p>

                </div>
</div>

我试过这个

entries = [{'text': div.text,
          'href': div.get('div', {'class', 'thumbnail'}).a,
          'src': div.get('src')
          } for div in divs]

但它不起作用

我在我的django应用程序中使用此功能。什么是刮掉href和src的正确语法。文本的工作原理不是src和href。

2 个答案:

答案 0 :(得分:2)

BeautifulSoup可能有更智能,内置的方式,但这似乎有效:

from bs4 import BeautifulSoup as soup

html = """
<div class="thumbnail thumb">
     <h6 id="date">May 2, 2016</h6>
         <img src="http://www.viveca.net/wp-content/uploads/2012/02/End_of_the_Line025.jpg" class="img-responsive post">

                <div style="border-bottom: thin solid lightslategray; padding-bottom: 15px;"></div>

                <div class="caption" id="cap">
                    <a href="/blog/just-filler/">
                        <h5 class="post-title" id="title">just filler</h5>
                    </a>

                    <p>
                        <a href="/blog/36/delete/" class="btn" role="button">delete</a>
                        <a href="/blog/just-filler/edit/" class="btn" role="button">edit</a>
                    </p>

                </div>
</div>
"""

parsed = soup(html, "html.parser")

divs = parsed.find_all("div")

entries = [{'text': div.text,
            'src' : map(lambda img : img.get("src"), div.find_all('img')),
            'href': map(lambda a : a.get("href"), div.find_all('a'))
          } for div in divs if "thumbnail" in div.get("class", [])]

print entries

输出:

[{'text': u'\nMay 2, 2016\n\n\n\n\njust filler\n\n\ndelete\nedit\n\n\n', 'href': [u'/blog/just-filler/', u'/blog/36/delete/', u'/blog/just-filler/edit/'], 'src': [u'http://www.viveca.net/wp-content/uploads/2012/02/End_of_the_Line025.jpg']}]

答案 1 :(得分:0)

这很有用 在我看来

entries = [{'text': div.text,
          'href': div.find('a').get('href'),
          'src': div.find('img').get('src')
          } for div in divs]

并在我的模板中

{% for e in entries %}
    <a href="{{url}}{{ e.href }}" class="thumbnail">{{ e.text }}</a><br>
{{e.href}}<br>