python regex findall <span> </span>

时间:2012-09-01 15:34:42

标签: python regex

我想找到<span class=""></span>

之间的所有内容
p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
text = re.findall(p, z)

例如在这种情况下<span class="">foo</span>期望返回foo但它返回任何东西! 为什么我的代码出错?

干杯

2 个答案:

答案 0 :(得分:4)

HTML is not a regular language, you really should use an XML parser instead

Python有几个可供选择:

答案 1 :(得分:2)

您的原始代码按原样运行。您应该使用HTML解析器。

import re
p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
z = '<span class="">foo</span>'
text = re.findall(p, z)
print text

输出:

['foo']

修改

Tim指出,re.DOTALL应该使用,否则下面会失败:

import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated foo</span>'''
text = re.findall(p, z)
print text

即便如此,嵌套跨度也会失败:

import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
text = re.findall(p, z)
print text

输出(失败):

[' a more\ncomplicated<span class="other">other']

所以使用像BeautifulSoup这样的HTML解析器:

from BeautifulSoup import BeautifulSoup
soup = bs(z)
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
soup = BeautifulSoup(z)
print soup.findAll('span',{'class':''})
print
print soup.findAll('span',{'class':'other'})

输出:

[<span class=""> a more
complicated<span class="other">other</span>foo</span>]

[<span class="other">other</span>]