Question

我有一个使用BS4废弃的遗留网页。其中一个部分是我需要废弃的一篇长篇文章。那篇文章格式奇怪，如：

<div id='essay'>
  this is paragraph1
  <p>this is paragraph2</p>
  this is paragraph3
  <p>this is paragraph4</p>
</div>

使用bs4，我尝试了以下方法：使用

soup.find('div', id='essay').text

我可以提取

'this is paragraph1' and 'this is paragraph3'

OR

ps = soup.find('div', id='essay').find_all('p')
for p in ps:
    print p.text

我可以提取

'this is paragraph2' and 'this is paragraph4'

如果我同时使用两者，我会得到第1,3,2,4段，这是不正常的。我需要确保段落顺序也正确。我能做些什么来实现这个目标？

编辑：问题中的问题只是一个例子，它不保证偶数和奇数段之间的交错...让我澄清一下我的问题：我想有办法提取段落IN SEQUENCE无论具有＆lt; p>或不。

Answer 1

BeautfulSoup4也有递归模式，默认情况下已启用。

from bs4 import BeautifulSoup
html = """
<div id='essay'>
  this is paragraph1
  <p>this is paragraph2</p>
  this is paragraph3
  <p>this is paragraph4</p>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
r = soup.find('div', id="essay", recursive=True).text
print (r)

完美适合我。尝试使用pip更新BeautifulSoup4。

Answer 2

如果列表长度相同，可能更容易交错，而不是编写代码以使用Beautiful Soup来解决原始格式

from itertools import chain

list_a = ['this is paragraph1', 'this is paragraph3']
list_b = ['this is paragraph2', 'this is paragraph4']

print(list(chain.from_iterable(zip(list_a, list_b))))


# ['this is paragraph1', 'this is paragraph2', 'this is paragraph3', 'this is paragraph4']

此处有更多信息：Interleaving Lists in Python

Answer 3

以下似乎有效

import bs4

soup = bs4.BeautifulSoup("""
<div id='essay'>
this is paragraph1
<p>this is paragraph2</p>
this is paragraph3
<p>this is paragraph4</p>
</div>
""", "lxml")

main = soup.find('div', id='essay')
for child in main.children:
    print(child.string)

使用BeautifulSoup4在</p> <div>部分

3 个答案: