在python中查找两个标记之间的所有内容

时间:2013-09-01 11:28:47

标签: python python-2.7 beautifulsoup

<p>This is the first paragraph with some details</p>
<p><a href = "user123">user1</a><font>This is opening contents for user1</font></p>
<p><font>This is the contents from user1</font></p>
<font><p>This is more content from user1</p></font>
<p><a href = "user234">user2</a><font>This is opening contents for user2</font></p>
<p><font>This is the contents from user2</font></p>
<font><p>This is more content from user1</p></font>
!----There is n number of data like this-----!

这是我html的结构。我的目标是提取用户及其内容。在这种情况下,它应该打印两个'a'标签之间的所有内容。这只是我的结构的一个例子,但在真正的html中,我在两个'a'标签之间有不同类型的标签。我需要一个解决方案来迭代'a'标签下的所有标签,直到找到另一个'a'标签。希望这很清楚。

我尝试的代码是:

for i in soup.findAll('a'):
    while(i.nextSibling.name!='a'):
        print i.nextSibling

我给我一个无限循环。所以,如果有人知道如何解决这个问题,请与我分享。

预期输出为:

用户名是:user1

text is:这是user1的开放内容这是来自user1的内容这是来自user1的更多内容

用户名是:user2

text is:这是user2的开放内容这是来自user2的内容这是来自user2的更多内容

依旧......

2 个答案:

答案 0 :(得分:1)

一种方法是使用<a>搜索每个find_all()标记,并使用find_all_next()搜索每个链接,以搜索包含每个用户内容的<font>个标记。以下脚本提取用户名及其内容,并将其保存为列表中的元组:

from bs4 import BeautifulSoup

l = []

soup = BeautifulSoup(open('htmlfile'))
for link in soup.find_all('a'):
    s = []
    for elem in link.find_all_next(['font', 'a']):
        if elem.name == 'a':
            break
        s.append(elem.string)
    user_content = ' '.join(s)
    l.append((link.string, user_content))

它产生:

[('user1', 'This is the contents from user1 This is more content from user1'),
 ('user2', 'This is the contents from user2 This is more content from user2')]

答案 1 :(得分:0)

试试这个:

from bs4 import BeautifulSoup

html="""
<p>This is the first paragraph with some details</p>
<p><a href="user123">user1</a><font>This is opening contents for user1</font></p>
<p><font>This is the contents from user1</font></p>
<font><p>This is more content from user1</p></font>
<p><a href="user234">user2</a><font>This is opening contents for user2</font></p>
<p><font>This is the contents from user2</font></p>
<font><p>This is more content from user1</p></font>
"""

soup = BeautifulSoup(html)
for i in soup.find_all('a'):
  print 'name:', i.text
  for s in [i, i.parent.find_next_sibling()]:
    while s <> None:
      if s.find('a') <> None:
        break
      print 'contents:', s.text
      s = s.find_next_sibling()

(注意:find_allfindAll的推荐名称,可能不适用于较旧的汤。与find_next_sibling相同。)