Question

我想使用Beautiful Soup Python库的findAll()函数，在HTML中查找几个元素。这些要素必须符合几个标准，但要独立。

例如，假设我的对象看起来像这样：

<div class="my_class">
    <span class="not_cool">
        <p name="p_1">A</p>
        <p name="p_2">B</p>
    </span>
    <span class="cool">
        <p name="p_3">C</p>
    </span>
</div>

我想找span class="cool" p和name="p_1" .findAll("span",attrs={"class":"cool"}) .findAll("p",attrs={"name":"p_1"})每个.findAll([ ["span",attrs={"class":"cool"}], ["p",attrs={"name":"p_1"}] ]}（这里只有一个，但想象不是这样））。

单独地，我会这样做：

.findAll()

在一个完美的世界里，我想做：

{{1}}

但当然，它不会像这样工作。

实际上，我尝试创建一个将HTML转换为BBCode的功能（我不想也不能使用现有的功能）。所以，我只需要保留一些我感兴趣的标签。

但是，我还必须知道这些元素的顺序。如果我使用两个不同的{{1}}，我将不知道之前是什么，以及之后是什么。

有人有解决方案吗？

Answer 1

您必须使用搜索功能：

.find_all(lambda t: (t.name == 'span' and 'cool' in t['class']) or
                    (t.name == 'p' and t.get('name') == 'p_1'))

可调用的参数将传递给树中的每个标记对象;如果可调用返回True则包含它。上述lambda测试标记名称是否匹配以及是否存在特定属性。 class属性是特殊的，当它存在时，它总是被解析为列表。

请注意，对于BeautifulSoup 4，不推荐使用驼峰函数名称; lower_case_with_underscore名称是规范方法。如果您仍在使用BeautifulSoup 3，则可能需要升级。版本3现在已经超过2年未见更新。

Answer 2

通过迭代所有期望的span，只需找到每个spans的所有孩子。

spans = soup.findAll("span",attrs={"class":"cool"})
for span in spans:
    ps = span.findAll("p",attrs={"name":"p_1"})

使用Python中的BeautifulSoup根据不同的标准查找不同的元素

2 个答案: