我有一些HTML,我想在其中找到包含逗号分隔数字的字符串,如
871,174次观看(这可能是1到n,其中有逗号)
我试过很多例如
'(\d+(,d+)*)\sViews'
但无法使其正常运行,因为我在运行时
re.findall(r'(\d+(,d+)*)\sViews', string)
,它给出了
[('174', '')]
其实我想得到这个号码。
编辑1: 这是我传递给正则表达式的字符串
<span class="fcg"><span id="fbPhotoPageCreatorInfo"></span></span><div class="mbs fbPhotosAudienceContainerNotEditable" id="fbPhotoPageAudienceSelector"><span class="mrs fbPhotosAudienceNotEditable fsm fwn fcg">Shared with:</span><div class="_6a _29ee _3iio _20nn _43_1" data-hover="tooltip" aria-label="Public" data-tooltip-alignh="center"><i class="img sp_e0NUBoHLxu_ sx_9486cc"></i><span class="_29ef">Public</span></div> </div><div></div><span class="fcg">871,174 Views</span>
答案 0 :(得分:2)
除非是拼写错误,否则你省略了反斜杠:
'(\d+)(,\d+)*\sViews'
# here __^
<强>测试强>
>>> html = """<span class="fcg">871,174 Views</span>"""
>>> import re
>>> pattern = re.compile(r'(\d+)(?:,(\d+))*\sViews')
>>> matches = re.findall(pattern, html)
>>> print(matches)
[('871', '174')]
答案 1 :(得分:0)
(\d+(?:,d+)*)
试试这个。这应该适合你。
答案 2 :(得分:0)
如果您不想使用BeautifulSoup获取文本并且要使用重新搜索整个字符串,那么如果您担心速度的话,rsplit会更快:
html = """<span class="fcg"><span id="fbPhotoPageCreatorInfo"></span></span><div class="mbs fbPhotosAudienceContainerNotEditable" id="fbPhotoPageAudienceSelector"><span class="mrs fbPhotosAudienceNotEditable fsm fwn fcg">Shared with:</span><div class="_6a _29ee _3iio _20nn _43_1" data-hover="tooltip" aria-label="Public" data-tooltip-alignh="center"><i class="img sp_e0NUBoHLxu_ sx_9486cc"></i><span class="_29ef">Public</span></div> </div><div></div><span class="fcg">871,174 Views</span>"""
import re
print(re.findall(("\d+"),html.rsplit('class="fcg">',1)[1]))
['871', '174']
In [13]: timeit re.findall(("\d+"),html.rsplit('class="fcg">',1)[1])
100000 loops, best of 3: 3.21 µs per loop
In [14]: timeit matches = re.findall(pattern, html)
10000 loops, best of 3: 20.1 µs per loop
与任何正则表达式相比,这种破坏的可能性大致相同,这就是为什么你应该使用beautifulSoup。
答案 3 :(得分:0)
import re
html = """<span class="fcg"><span id="fbPhotoPageCreatorInfo"></span></span><div class="mbs fbPhotosAudienceContainerNotEditable" id="fbPhotoPageAudienceSelector"><span class="mrs fbPhotosAudienceNotEditable fsm fwn fcg">Shared with:</span><div class="_6a _29ee _3iio _20nn _43_1" data-hover="tooltip" aria-label="Public" data-tooltip-alignh="center"><i class="img sp_e0NUBoHLxu_ sx_9486cc"></i><span class="_29ef">Public</span></div> </div><div></div><span class="fcg">871,174 Views</span>"""
p = re.compile(r"[\d\,]+(?=\sViews)")
print p.findall(html)