Question

我有一长串博客评论，编号为

<p> This is the text <br /> of the comment </p>
<div id="comment_details"> Here the details of the same comment </div>

我需要在循环的同一循环中解析注释和细节，以便有序地存储它们。

然而，我不确定应该如何继续，因为我可以在两个不同的循环中轻松地解析它们。仅在一个人中做到这一点是否优雅和实用？

请考虑以下MWE

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<body>

<div id="firstDiv">
     <br></br>
     <p>My first paragraph.<br />But this a second line</p>
     <div id="secondDiv">
          <b>Date1</b>
     </div> 
     <br></br>  
     <p>My second paragraph.</p>
     <div id="secondDiv">
          <b>Date2</b>
     </div> 
     <br></br>
     <p>My third paragraph.</p>
     <div id="secondDiv">
          <b>Date3</b>
     </div>
     <br></br>
     <p>My fourth paragraph.</p>
     <div id="secondDiv">
          <b>Date4</b>
     </div>
     <br></br>
     <p>My fifth paragraph.</p>
     <div id="secondDiv">
          <b>Date5</b>
     </div>
     <br></br>
 </div>

</body>
</html>
"""

soup = BeautifulSoup(html_doc)

for p in soup.find(id="firstDiv").find_all("p"):
        print p.get_text()

for div in soup.find(id="firstDiv").find_all(id="secondDiv"):
        print div.b.get_text()

Answer 1

如果你真的想要后续的兄弟姐妹，那就很容易了：

for p in soup.find(id="firstDiv").find_all("p"):
    print p.get_text()
    print p.next_sibling.b.get_text()

然而，p之后的下一件事是字符串`＆＃39; \ n＆＃39;，而不是您想要的div。

问题是，p和div之间没有真正的结构关系;只是碰巧每个p总是有一个div与某个id作为后来的兄弟姐妹，并且你想要利用它。（如果您正在生成此HTML，请明确修复结构......但我认为您不是。）所以，这里有一些选项。

最好的可能是：

for p in soup.find(id="firstDiv").find_all("p"):
    print p.get_text()
    print p.find_next_sibling(id='secondDiv').b.get_text()

如果您只关心这个特定的文件，并且您知道下一个兄弟姐妹之后的下一个兄弟将永远是您想要的div：

print p.get_text()
print p.next_sibling.next_sibling.b.get_text()

或者你可以依赖find_next_sibling()没有参数的事实，与next_sibling不同，跳过第一个实际的DOM元素，所以：

print p.get_text()
print p.get_next_sibling().b.get_text()

如果你不想依赖任何一个，但可以依靠他们总是一对一的事实（也就是说，没有任何流浪的可能性p没有相应secondDiv的元素，你可以将两个搜索压缩在一起：

fdiv = soup.find(id='firstDiv')
for p, sdiv in zip(fdiv.find_all('p'), fdiv.find_all(id='secondDiv'):
    print p.get_text(), div.b.get_text()

你也可以迭代p.next_siblings来找到你想要的元素：

for p in soup.find(id='firstDiv').find_all('p'):
    div = next(sib for sib in p.next_siblings if sib.id == 'secondDiv')
    print p.get_text(), div.b.get_text()

但最终，这只是编写第一个解决方案的更冗长的方式，所以回到那个解决方案。：）

使用BeautifulSoup解析段落和后续元素，并使用一个循环周期

1 个答案: