<p align="JUSTIFY"><a href="#abcd"> Mr A </a></p>
<p align="JUSTIFY">I </p>
<p align="JUSTIFY"> have a question </p>
<p align="JUSTIFY"> </p>
<p align="JUSTIFY"><a href="#mnop"> Mr B </a></p>
<p align="JUSTIFY">The </p>
<p align="JUSTIFY">answer is</p>
<p align="JUSTIFY">not there</p>
<p align="JUSTIFY"> </p>
<p align="JUSTIFY"><a href="wxyz"> Mr C </a></p>
<p align="JUSTIFY">Please</p>
<p align="JUSTIFY">Help</p>
我希望借助
迭代数据的提取。
如果有人知道该怎么做,它可能会有用,因为我正在努力学习python遇到这个问题。我试过的代码是
for t in soup.findAll('p',text = re.compile(' '), attrs = {'align' : 'JUSTIFY'}):
print t
for item in t.parent.next_siblings:
if isinstance(item, Tag):
if 'p' in item.attrs and 'align' in item.attrs['p']:
break
print item
返回[]不是想要的东西
答案 0 :(得分:3)
你可以用BeautifulSoup来做到这一点:
from bs4 import BeautifulSoup
s = ""
html = '<p align="JUSTIFY">I </p>\
<p align="JUSTIFY"> have a question </p>\
<p align="JUSTIFY"> </p>\
<p align="JUSTIFY">The </p>\
<p align="JUSTIFY">answer is</p>\
<p align="JUSTIFY">not there</p>\
<p align="JUSTIFY"> </p>\
<p align="JUSTIFY">Please</p>\
<p align="JUSTIFY">Help</p>'
soup = BeautifulSoup(html)
title = soup.findAll("p", {"align" : "JUSTIFY"})
for i in title:
s += ''.join(i.contents)
f = s.split(" ")
for i in f:
print i
答案 1 :(得分:0)
使用正则表达式的另一种方法:
from re import sub
html = '<p align="JUSTIFY">I </p>\
<p align="JUSTIFY"> have a question </p>\
<p align="JUSTIFY"> </p>\
<p align="JUSTIFY">The </p>\
<p align="JUSTIFY">answer is</p>\
<p align="JUSTIFY">not there</p>\
<p align="JUSTIFY"> </p>\
<p align="JUSTIFY">Please</p>\
<p align="JUSTIFY">Help</p>'
print [sub("\s+", " ", x).strip() for x in sub("<.*?>", " ", html).split(" ")]
输出:
['I have a question', 'The answer is not there', 'Please Help']