我有以下字符串:
s = '''
<a class="biz-name"><span>Gus’s World Famous Fried Chicken</span></a>
<a class="biz-name"><span>South City Kitchen - Midtown</span></a>
'''
我正在尝试匹配<span>
s
使用以下
regex = re.compile('<a class="biz-name[\w\W]*<span>(.*)</span>')
regex.findall(s)
预期:
['Gus’s World Famous Fried Chicken', 'South City Kitchen - Midtown']
实际
['South City Kitchen - Midtown']
为什么只匹配最后一次?
答案 0 :(得分:1)
You shouldn't parse xml with regex。也就是说,正则表达式的贪婪让你,[\w\W]*
几乎匹配任何东西,所以它吃掉了第一个表达式。
添加非贪婪的?
令牌([\w\W]*?
)可以解决这个问题。并且在组中添加一个并没有伤害。我已将[\w\W]*?
替换为.*?
,因为它更简单,更等效。
regex = re.compile('<a class="biz-name.*?<span>(.*?)</span>')
在regex101上查看。
答案 1 :(得分:1)
正则表达式通常不是刮取HTML的最佳方法。例如,另一种方法是使用BeautifulSoup
:
from bs4 import BeautifulSoup
s = '''
<a class="biz-name"><span>Gus’s World Famous Fried Chicken</span></a>
<a class="biz-name"><span>South City Kitchen - Midtown</span></a>
'''
s = BeautifulSoup(s, 'lxml')
results = [i.text for i in s.find_all('span')]
输出:
[u'Gus’s World Famous Fried Chicken', u'South City Kitchen - Midtown']
然而,一个简单的正则表达式解决方案:
import re
s = '''
<a class="biz-name"><span>Gus’s World Famous Fried Chicken</span></a>
<a class="biz-name"><span>South City Kitchen - Midtown</span></a>
'''
final_results = re.findall('<span>(.*?)</span>', s)
输出:
['Gus’s World Famous Fried Chicken', 'South City Kitchen - Midtown']