Question

<p>This is the first paragraph with some details</p>
<p><a href = "user123">user1</a><font>This is opening contents for user1</font></p>
<p><font>This is the contents from user1</font></p>
<font><p>This is more content from user1</p></font>
<p><a href = "user234">user2</a><font>This is opening contents for user2</font></p>
<p><font>This is the contents from user2</font></p>
<font><p>This is more content from user1</p></font>
!----There is n number of data like this-----!

这是我html的结构。我的目标是提取用户及其内容。在这种情况下，它应该打印两个'a'标签之间的所有内容。这只是我的结构的一个例子，但在真正的html中，我在两个'a'标签之间有不同类型的标签。我需要一个解决方案来迭代'a'标签下的所有标签，直到找到另一个'a'标签。希望这很清楚。

我尝试的代码是：

for i in soup.findAll('a'):
    while(i.nextSibling.name!='a'):
        print i.nextSibling

我给我一个无限循环。所以，如果有人知道如何解决这个问题，请与我分享。

预期输出为：

用户名是：user1

text is：这是user1的开放内容这是来自user1的内容这是来自user1的更多内容

用户名是：user2

text is：这是user2的开放内容这是来自user2的内容这是来自user2的更多内容

依旧......

Answer 1

一种方法是使用<a>搜索每个find_all()标记，并使用find_all_next()搜索每个链接，以搜索包含每个用户内容的<font>个标记。以下脚本提取用户名及其内容，并将其保存为列表中的元组：

from bs4 import BeautifulSoup

l = []

soup = BeautifulSoup(open('htmlfile'))
for link in soup.find_all('a'):
    s = []
    for elem in link.find_all_next(['font', 'a']):
        if elem.name == 'a':
            break
        s.append(elem.string)
    user_content = ' '.join(s)
    l.append((link.string, user_content))

它产生：

[('user1', 'This is the contents from user1 This is more content from user1'),
 ('user2', 'This is the contents from user2 This is more content from user2')]

Answer 2

试试这个：

from bs4 import BeautifulSoup

html="""
<p>This is the first paragraph with some details</p>
<p><a href="user123">user1</a><font>This is opening contents for user1</font></p>
<p><font>This is the contents from user1</font></p>
<font><p>This is more content from user1</p></font>
<p><a href="user234">user2</a><font>This is opening contents for user2</font></p>
<p><font>This is the contents from user2</font></p>
<font><p>This is more content from user1</p></font>
"""

soup = BeautifulSoup(html)
for i in soup.find_all('a'):
  print 'name:', i.text
  for s in [i, i.parent.find_next_sibling()]:
    while s <> None:
      if s.find('a') <> None:
        break
      print 'contents:', s.text
      s = s.find_next_sibling()

（注意：find_all是findAll的推荐名称，可能不适用于较旧的汤。与find_next_sibling相同。）

在python中查找两个标记之间的所有内容

2 个答案: