我有这个HTML文字:
<div>
<div class="item1"> value 1 </div>
\n
<div class="item1"> value 2 </div>
\n
<div class="item1"> value 3 </div>
</div>
div标签之间有未知文本:
我想获得value 3
我试过了:re.findall(r'class="item1".*?{3}>(.*?)</div>',x,re.S)
但是我得到无效的重复错误,因为我用户{3},得到的只是第三次匹配?
答案 0 :(得分:2)
通过BeautifulSoup css selectors。
>>> from bs4 import BeautifulSoup
>>> s = """<div>
<div class="item1"> value 1 </div>
<div class="item1"> value 2 </div>
<div class="item1"> value 3 </div>
</div>"""
>>> soup = BeautifulSoup(s)
>>> soup
<html><body><div>
<div class="item1"> value 1 </div>
<div class="item1"> value 2 </div>
<div class="item1"> value 3 </div>
</div></body></html>
>>> [i.string for i in soup.select('div > div[class~=item1]')[-1]]
[' value 3 ']
>>> [i.string.strip() for i in soup.select('div > div[class~=item1]')[-1]]
['value 3']
像其他人说的那样,不要用正则表达式解析html文件。
>>> re.findall(r'<div\s+class="item1">\s*(.*?)\s+</div>', s)[-1]
'value 3'