Question

我想从BeautifulSoup的网站上获取商品。

<div class="post item">

目标标签是这样的。标签有两个attrs和空格。

首先，我写道，

roots = soup.find_all("div", "post item")

但是，它没有用。然后我写了，

html.find_all("div", {'class':['post', 'item']})

我可以用这个来获取物品，但我不确定这是否正确。这段代码是否正确？

////附加////

对不起，

html.find_all("div", {'class':['post', 'item']})

无效。它还提取class="item"。

而且，我必须写，

soup.find_all("div", class_="post item")

不是=而是_=。虽然这对我不起作用...（＆gt; _＆lt;）

目标网址：

https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb

mycode的：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup

def main():
    target = "https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb"
    html = urlopen(target)
    soup = BeautifulSoup(html, "html.parser")
    roots = soup.find_all("div", class_="post item")
    print(roots)
        for root in roots:
            print("##################")


if __name__ == '__main__':
    main()

Answer 1

您可以使用css select：

print(translated)

或使用class_

soup.select("div.post.item")

文档建议*如果要搜索与两个或更多CSS类匹配的标记，则应根据第一个示例使用CSS选择器。这两个用例的例子是：

您还可以搜索class属性的确切字符串值：

.find_all("div", class_="post item")

如果要搜索与两个或更多CSS类匹配的标记，则应使用CSS选择器：

css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

为什么你的代码失败了，上面的任何一个解决方案都会失败，更多的是因为这个类在源代码中不存在，它就在那里它们都可以工作：

css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

如果您查看浏览器来源并进行搜索，您也无法找到它。它是动态生成的，只有在您打开开发人员控制台或firebug时才能看到它。它们也只包含一些样式和反应ID，因此即使你确实得到了它们，也不确定你希望从中得到什么。

如果您想获得在浏览器中看到的html，则需要selenium

之类的内容

Answer 2

首先，请注意class是一个非常特殊的multi-valued attribute，它是BeautifulSoup中混淆的常见原因。

html.find_all("div", {'class':['post', 'item']})

这会找到div类或post类（或两者当然）的所有item个元素。假设您使用严格div的{{1}}元素，这可能会产生您不希望看到的额外结果。如果是这种情况，您可以使用CSS选择器：

class="post item"

在类似的帖子中还有一些信息：

BeautifulSoup returns empty list when searching by compound class names

这种方法是从具有2个类属性的标签获取项目与BeautifulSoup正确吗？

2 个答案: