如何查找包含特定文本的HTML标签? - BeautifulSoup

时间:2012-10-25 01:27:46

标签: python regex beautifulsoup

这是来源:

<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span>

<span class="new"> <a class="blog" href="http://whatever2.com" rel="nofollow">whatever2</a> do other things at <a class="others" href="http://example2.com" rel="nofollow">example2</a></span>

<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span> 

我想在其中找到所有<span class="new"> do something at,这是我的代码,我只是不知道为什么它不起作用:

soup = bs4.BeautifulSoup(html, "lxml")
all_tags = soup.findAll(name = "span", attrs = {"class": "new"}, text = re.compile('do something.*'))

什么都没找到。如果删除text = re.compile('.*do something.*')以上所有标记都可以找到,我知道我的正则表达式模式应该有问题,那么正确的形式是什么?

3 个答案:

答案 0 :(得分:1)

你总是可以尝试混合方法:

soup = bs4.BeautifulSoup(html, "lxml")
spans = soup.findAll("span", attrs = {"class": "new"})
regex = re.compile('.*do something at.*')
desired_tags = [span for span in spans if regex.match(span.text)]

答案 1 :(得分:0)

遍历html文件内容并打印匹配的行。在这里,我用列表l:

替换了文件内容
>>> l = ['<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span>', 

'<span class="new"> <a class="blog" href="http://whatever2.com" rel="nofollow">whatever2</a> do other things at <a class="others" href="http://example2.com" rel="nofollow">example2</a></span>',

'<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span>' ]
>>> for i in range(len(l)):
    if re.search('<span class="new">.*do something.*', l[i]):
        print l[i]


<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span>
<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span>
>>> 

答案 2 :(得分:0)

这就是我通常找到文字的方式。

spans = soup.findAll("span", attrs = {"class": "new"})
for s in spans:
    if "do something" in str(s):