在python中使用BeautifulSoup选择多个标签项

时间:2018-11-24 09:14:22

标签: python web-scraping beautifulsoup

我有下一个html:

<html>
<body>
...
</article>
<article class="issue">
<div class="issue-nr">#39</div>
<div class="issue-date">
<time datetime="2018-04-29T07:30:02+01:00">Apr 29, 2018</time>
</div>
<div class="issue-title">
<h1>
<a href="/" rel="" target="" title="Title"><span class="subject">The... - #39</span>
<span class="description">
 –
Blah, Bleh, Blih ...
</span>
</a></h1>
</div>
</article>
<article class="issue">
<div class="issue-nr">#38</div>
<div class="issue-date">
<time datetime="2018-04-28T07:30:00+01:00">Apr 28, 2018</time>
</div>
<div class="issue-title">
<h1>
<a href="/" rel="" target="" title="Title"><span class="subject">The... - #38</span>
<span class="description">
 –
Blah, Bleh, Blih ...
</span>
</a></h1>
</div>
</article>
<article class="issue">
<div class="issue-nr">#37</div>
<div class="issue-date">
<time datetime="2018-04-27T07:30:02+01:00">Apr 27, 2018</time>
</div>
<div class="issue-title">
<h1>
<a href="/" rel="" target="" title="Title"><span class="subject">The... - #37</span>
<span class="description">
 –
Blah, Bleh, Blih ...
</span>
</a></h1>
</div>
</article>
...
</body>
</html>

我想遍历每个文章标签,我真的很了解:

from requests import get
from bs4 import BeautifulSoup

response = get("https://example.com")


soup = BeautifulSoup(response.text, "html.parser")
issues = soup.find_all("article", {"class": "issue"})

for issue in issues:
    print (issue)

现在我要从每个文章标签中选择类为“ description”的span标签,但是当我调用“ issue.span”时,仅选择找到的第一个标签。

有什么建议吗?

谢谢。

1 个答案:

答案 0 :(得分:1)

您的意思如下。结合使用CSS选择器?我使用descendant combinator组合选择器,以便获得span.description的{​​{1}}个孩子。这种书写方式意味着您将只获得它们存在的描述,因此不需要其他测试。

article.issue

结果:

enter image description here


对于您来说,您需要从from bs4 import BeautifulSoup html = ''' <html> <body> ... </article> <article class="issue"> <div class="issue-nr">#39</div> <div class="issue-date"> <time datetime="2018-04-29T07:30:02+01:00">Apr 29, 2018</time> </div> <div class="issue-title"> <h1> <a href="/" rel="" target="" title="Title"><span class="subject">The... - #39</span> <span class="description"> – Blah, Bleh, Blih ... </span> </a></h1> </div> </article> <article class="issue"> <div class="issue-nr">#38</div> <div class="issue-date"> <time datetime="2018-04-28T07:30:00+01:00">Apr 28, 2018</time> </div> <div class="issue-title"> <h1> <a href="/" rel="" target="" title="Title"><span class="subject">The... - #38</span> <span class="description"> – Blah, Bleh, Blih ... </span> </a></h1> </div> </article> <article class="issue"> <div class="issue-nr">#37</div> <div class="issue-date"> <time datetime="2018-04-27T07:30:02+01:00">Apr 27, 2018</time> </div> <div class="issue-title"> <h1> <a href="/" rel="" target="" title="Title"><span class="subject">The... - #37</span> <span class="description"> – Blah, Bleh, Blih ... </span> </a></h1> </div> </article> ... </body> </html> ''' soup = BeautifulSoup(html, "lxml") descriptions = soup.select('article.issue span.description') descriptions = [description.text for description in descriptions] print(descriptions) 中选择span.description

issue