如何获得比赛号码?正则表达式

时间:2014-11-29 13:51:16

标签: python regex python-2.7

我有这个HTML文字:

<div>
     <div class="item1">  value 1 </div>
                \n
     <div class="item1">  value 2 </div>
               \n
     <div class="item1">  value 3 </div> 

</div>

div标签之间有未知文本:

我想获得value 3

我试过了:re.findall(r'class="item1".*?{3}>(.*?)</div>',x,re.S)

但是我得到无效的重复错误,因为我用户{3},得到的只是第三次匹配?

1 个答案:

答案 0 :(得分:2)

通过BeautifulSoup css selectors

>>> from bs4 import BeautifulSoup
>>> s = """<div>
     <div class="item1">  value 1 </div>

     <div class="item1">  value 2 </div>

     <div class="item1">  value 3 </div> 

</div>"""
>>> soup = BeautifulSoup(s)
>>> soup
<html><body><div>
<div class="item1">  value 1 </div>
<div class="item1">  value 2 </div>
<div class="item1">  value 3 </div>
</div></body></html>
>>> [i.string for i in soup.select('div > div[class~=item1]')[-1]]
['  value 3 ']
>>> [i.string.strip() for i in soup.select('div > div[class~=item1]')[-1]]
['value 3']

像其他人说的那样,不要用正则表达式解析html文件。

>>> re.findall(r'<div\s+class="item1">\s*(.*?)\s+</div>', s)[-1]
'value 3'