我有一个使用BS4废弃的遗留网页。其中一个部分是我需要废弃的一篇长篇文章。那篇文章格式奇怪,如:
<div id='essay'>
this is paragraph1
<p>this is paragraph2</p>
this is paragraph3
<p>this is paragraph4</p>
</div>
使用bs4,我尝试了以下方法: 使用
soup.find('div', id='essay').text
我可以提取
'this is paragraph1' and 'this is paragraph3'
OR
ps = soup.find('div', id='essay').find_all('p')
for p in ps:
print p.text
我可以提取
'this is paragraph2' and 'this is paragraph4'
如果我同时使用两者,我会得到第1,3,2,4段,这是不正常的。我需要确保段落顺序也正确。我能做些什么来实现这个目标?
编辑:问题中的问题只是一个例子,它不保证偶数和奇数段之间的交错...让我澄清一下我的问题:我想有办法提取段落IN SEQUENCE无论具有&lt; p>或不。答案 0 :(得分:-1)
BeautfulSoup4也有递归模式,默认情况下已启用。
from bs4 import BeautifulSoup
html = """
<div id='essay'>
this is paragraph1
<p>this is paragraph2</p>
this is paragraph3
<p>this is paragraph4</p>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
r = soup.find('div', id="essay", recursive=True).text
print (r)
完美适合我。 尝试使用pip更新BeautifulSoup4。
答案 1 :(得分:-2)
如果列表长度相同,可能更容易交错,而不是编写代码以使用Beautiful Soup来解决原始格式
from itertools import chain
list_a = ['this is paragraph1', 'this is paragraph3']
list_b = ['this is paragraph2', 'this is paragraph4']
print(list(chain.from_iterable(zip(list_a, list_b))))
# ['this is paragraph1', 'this is paragraph2', 'this is paragraph3', 'this is paragraph4']
此处有更多信息:Interleaving Lists in Python
答案 2 :(得分:-2)
以下似乎有效
import bs4
soup = bs4.BeautifulSoup("""
<div id='essay'>
this is paragraph1
<p>this is paragraph2</p>
this is paragraph3
<p>this is paragraph4</p>
</div>
""", "lxml")
main = soup.find('div', id='essay')
for child in main.children:
print(child.string)