Question

我想找到和

之间的所有内容

p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
text = re.findall(p, z)

例如在这种情况下foo期望返回foo但它返回任何东西！为什么我的代码出错？

干杯

Answer 1

自HTML is not a regular language, you really should use an XML parser instead。

Python有几个可供选择：

ElementTree是标准库的一部分
BeautifulSoup是受欢迎的第三方图书馆
lxml是一个快速且功能丰富的基于C的库。

Answer 2

您的原始代码按原样运行。您应该使用HTML解析器。

import re
p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
z = '<span class="">foo</span>'
text = re.findall(p, z)
print text

输出：

['foo']

修改

Tim指出，re.DOTALL应该使用，否则下面会失败：

import re p = re.compile('(.*?)\', re.IGNORECASE|re.DOTALL) z = ''' a more complicated foo''' text = re.findall(p, z) print text

即便如此，嵌套跨度也会失败：

import re p = re.compile('(.*?)\', re.IGNORECASE|re.DOTALL) z = ''' a more complicatedotherfoo''' text = re.findall(p, z) print text

输出（失败）：

[' a more\ncomplicatedother']

所以使用像BeautifulSoup这样的HTML解析器：

from BeautifulSoup import BeautifulSoup soup = bs(z) p = re.compile('(.*?)\', re.IGNORECASE|re.DOTALL) z = ''' a more complicatedotherfoo''' soup = BeautifulSoup(z) print soup.findAll('span',{'class':''}) print print soup.findAll('span',{'class':'other'})

输出：

[ a more complicatedotherfoo] [other]

python regex findall <span> </span>

2 个答案: