我想找到<span class="">
和</span>
p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
text = re.findall(p, z)
例如在这种情况下<span class="">foo</span>
期望返回foo但它返回任何东西!
为什么我的代码出错?
干杯
答案 0 :(得分:4)
自HTML is not a regular language, you really should use an XML parser instead。
Python有几个可供选择:
答案 1 :(得分:2)
您的原始代码按原样运行。您应该使用HTML解析器。
import re
p = re.compile('<span class=\"\">(.*?)\</span>', re.IGNORECASE)
z = '<span class="">foo</span>'
text = re.findall(p, z)
print text
输出:
['foo']
修改强>
Tim指出,re.DOTALL
应该使用,否则下面会失败:
import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated foo</span>'''
text = re.findall(p, z)
print text
即便如此,嵌套跨度也会失败:
import re
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
text = re.findall(p, z)
print text
输出(失败):
[' a more\ncomplicated<span class="other">other']
所以使用像BeautifulSoup这样的HTML解析器:
from BeautifulSoup import BeautifulSoup
soup = bs(z)
p = re.compile('<span class="">(.*?)\</span>', re.IGNORECASE|re.DOTALL)
z = '''<span class=""> a more
complicated<span class="other">other</span>foo</span>'''
soup = BeautifulSoup(z)
print soup.findAll('span',{'class':''})
print
print soup.findAll('span',{'class':'other'})
输出:
[<span class=""> a more
complicated<span class="other">other</span>foo</span>]
[<span class="other">other</span>]