<p>This is the first paragraph with some details</p>
<p><a href = "user123">user1</a><font>This is opening contents for user1</font></p>
<p><font>This is the contents from user1</font></p>
<font><p>This is more content from user1</p></font>
<p><a href = "user234">user2</a><font>This is opening contents for user2</font></p>
<p><font>This is the contents from user2</font></p>
<font><p>This is more content from user1</p></font>
!----There is n number of data like this-----!
这是我html的结构。我的目标是提取用户及其内容。在这种情况下,它应该打印两个'a'标签之间的所有内容。这只是我的结构的一个例子,但在真正的html中,我在两个'a'标签之间有不同类型的标签。我需要一个解决方案来迭代'a'标签下的所有标签,直到找到另一个'a'标签。希望这很清楚。
我尝试的代码是:
for i in soup.findAll('a'):
while(i.nextSibling.name!='a'):
print i.nextSibling
我给我一个无限循环。所以,如果有人知道如何解决这个问题,请与我分享。
预期输出为:
用户名是:user1
text is:这是user1的开放内容这是来自user1的内容这是来自user1的更多内容
用户名是:user2
text is:这是user2的开放内容这是来自user2的内容这是来自user2的更多内容
依旧......
答案 0 :(得分:1)
一种方法是使用<a>
搜索每个find_all()
标记,并使用find_all_next()
搜索每个链接,以搜索包含每个用户内容的<font>
个标记。以下脚本提取用户名及其内容,并将其保存为列表中的元组:
from bs4 import BeautifulSoup
l = []
soup = BeautifulSoup(open('htmlfile'))
for link in soup.find_all('a'):
s = []
for elem in link.find_all_next(['font', 'a']):
if elem.name == 'a':
break
s.append(elem.string)
user_content = ' '.join(s)
l.append((link.string, user_content))
它产生:
[('user1', 'This is the contents from user1 This is more content from user1'),
('user2', 'This is the contents from user2 This is more content from user2')]
答案 1 :(得分:0)
试试这个:
from bs4 import BeautifulSoup
html="""
<p>This is the first paragraph with some details</p>
<p><a href="user123">user1</a><font>This is opening contents for user1</font></p>
<p><font>This is the contents from user1</font></p>
<font><p>This is more content from user1</p></font>
<p><a href="user234">user2</a><font>This is opening contents for user2</font></p>
<p><font>This is the contents from user2</font></p>
<font><p>This is more content from user1</p></font>
"""
soup = BeautifulSoup(html)
for i in soup.find_all('a'):
print 'name:', i.text
for s in [i, i.parent.find_next_sibling()]:
while s <> None:
if s.find('a') <> None:
break
print 'contents:', s.text
s = s.find_next_sibling()
(注意:find_all
是findAll
的推荐名称,可能不适用于较旧的汤。与find_next_sibling
相同。)