Question

可能重复：
Beautiful Soup cannot find a CSS class if the object has other classes, too

我正在使用BeautifulSoup在HTML中查找tables。我目前遇到的问题是在class属性中使用空格。如果我的HTML显示为<html><table class="wikitable sortable">blah</table></html>，我似乎无法使用以下内容提取它（我可以在tables和wikipedia找到wikipedia sortable class）：

BeautifulSoup(html).findAll(attrs={'class':re.compile("wikitable( sortable)?")})

如果我的HTML只是<html><table class="wikitable">blah</table></html>，这将找到该表。同样，我尝试在我的正则表达式中使用"wikitable sortable"，但这也不匹配。有什么想法吗？

Answer 1

如果wikitable出现在另一个CSS类之后，模式匹配也将失败，如class="something wikitable other"中所示，因此如果您想要所有类属性包含类wikitable的表，则需要接受更多可能性的模式：

html = '''<html><table class="sortable wikitable other">blah</table>
<table class="wikitable sortable">blah</table>
<table class="wikitable"><blah></table></html>'''

tree = BeautifulSoup(html)
for node in tree.findAll(attrs={'class': re.compile(r".*\bwikitable\b.*")}):
    print node

结果：

<table class="sortable wikitable other">blah</table>
<table class="wikitable sortable">blah</table>
<table class="wikitable"><blah></blah></table>

仅仅是为了记录，我不使用BeautifulSoup，而是像其他人提到的那样使用lxml。

Answer 2

使lxml比BeautifulSoup更好的一个原因是支持类似CSS的类选择（如果你想使用它们，甚至支持full css selectors）

import lxml.html

html = """<html>
<body>
<div class="bread butter"></div>
<div class="bread"></div>
</body>
</html>"""

tree = lxml.html.fromstring(html)

elements = tree.find_class("bread")

for element in elements:
    print lxml.html.tostring(element)

给出：

<div class="bread butter"></div>
<div class="bread"></div>

BeautifulSoup和按类搜索

2 个答案: