如何提取之间的所有数据

时间:2013-08-08 10:08:21

标签: python python-2.7

<p align="JUSTIFY"><a href="#abcd"> Mr A </a></p>
<p align="JUSTIFY">I </p>
<p align="JUSTIFY"> have a question </p>
<p align="JUSTIFY">&nbsp;</p>
<p align="JUSTIFY"><a href="#mnop"> Mr B </a></p>
<p align="JUSTIFY">The </p>
<p align="JUSTIFY">answer is</p>
<p align="JUSTIFY">not there</p>
<p align="JUSTIFY">&nbsp;</p>
<p align="JUSTIFY"><a href="wxyz"> Mr C </a></p>
<p align="JUSTIFY">Please</p>
<p align="JUSTIFY">Help</p>

我希望借助&nbsp;迭代数据的提取。

  • 第一次迭代应该显示我有一个问题
  • 第二次迭代应显示答案不存在
  • 人名也应该在不同的名单中提取..例如['A先生','先生B','先生C']

如果有人知道该怎么做,它可能会有用,因为我正在努力学习python遇到这个问题。我试过的代码是

for t in soup.findAll('p',text = re.compile('&nbsp;'), attrs = {'align' : 'JUSTIFY'}):
    print t
    for item in t.parent.next_siblings:
        if isinstance(item, Tag):
            if 'p' in item.attrs and 'align' in item.attrs['p']:
                break
            print item

返回[]不是想要的东西

2 个答案:

答案 0 :(得分:3)

你可以用BeautifulSoup来做到这一点:

from bs4 import BeautifulSoup

s = ""

html = '<p align="JUSTIFY">I </p>\
<p align="JUSTIFY"> have a question </p>\
<p align="JUSTIFY">&nbsp;</p>\
<p align="JUSTIFY">The </p>\
<p align="JUSTIFY">answer is</p>\
<p align="JUSTIFY">not there</p>\
<p align="JUSTIFY">&nbsp;</p>\
<p align="JUSTIFY">Please</p>\
<p align="JUSTIFY">Help</p>'

soup = BeautifulSoup(html)
title = soup.findAll("p", {"align" : "JUSTIFY"})

for i in title:
    s += ''.join(i.contents)

f =  s.split("&nbsp;")
for i in f:
    print i

答案 1 :(得分:0)

使用正则表达式的另一种方法:

from re import sub

html = '<p align="JUSTIFY">I </p>\
<p align="JUSTIFY"> have a question </p>\
<p align="JUSTIFY">&nbsp;</p>\
<p align="JUSTIFY">The </p>\
<p align="JUSTIFY">answer is</p>\
<p align="JUSTIFY">not there</p>\
<p align="JUSTIFY">&nbsp;</p>\
<p align="JUSTIFY">Please</p>\
<p align="JUSTIFY">Help</p>'

print [sub("\s+", " ", x).strip() for x in sub("<.*?>", " ", html).split("&nbsp;")]

输出:

['I have a question', 'The answer is not there', 'Please Help']