Question

我使用Beautiful Soup拉出特定的div标签，似乎我无法使用简单的字符串匹配

该页面有一些

形式的标签

<div class="comment form new"...>

我想忽略，还有一些标签以

的形式出现

<div class="comment comment-xxxx...">

其中x表示任意长度的整数，椭圆表示由空格分隔的任意数量的其他值（我不关心）。我无法弄明白正确的正则表达式，特别是因为我从未使用过python的relass。

使用

soup.find_all(class_="comment")

查找以单词comment开头的所有标签。我尝试过使用

soup.find_all(class_=re.compile(r'(comment)( )(comment)'))
soup.find_all(class_=re.compile(r'comment comment.*'))

以及许多其他变体，但我认为我在这里遗漏了一些关于正则表达式或match（）如何工作的明显内容。任何人都可以帮助我吗？

Answer 1

我想我已经明白了：

>>> [div['class'] for div in soup.find_all('div')]
[['comment', 'form', 'new'], ['comment', 'comment-xxxx...']]

请注意，与BS3中的等效物不同，它不是这样的：

['comment form new', 'comment comment-xxxx...']

这就是你的正则表达式不匹配的原因。

但你可以匹配，例如：

>>> soup.find_all('div', class_=re.compile('comment-'))
[<div class="comment comment-xxxx..."></div>]

请注意，BS相当于re.search，而不是re.match，因此您不需要'comment-.*'。当然，如果您想匹配'comment-12345'但不想匹配'comment-of-another-kind，例如'comment-\d+'。