使用BeautifulSoup4在</p> <div>部分

时间:2016-07-04 21:31:48

标签: python web-scraping beautifulsoup

我有一个使用BS4废弃的遗留网页。其中一个部分是我需要废弃的一篇长篇文章。那篇文章格式奇怪,如:

<div id='essay'>
  this is paragraph1
  <p>this is paragraph2</p>
  this is paragraph3
  <p>this is paragraph4</p>
</div>

使用bs4,我尝试了以下方法: 使用

soup.find('div', id='essay').text

我可以提取

'this is paragraph1' and 'this is paragraph3'

OR

ps = soup.find('div', id='essay').find_all('p')
for p in ps:
    print p.text

我可以提取

'this is paragraph2' and 'this is paragraph4'

如果我同时使用两者,我会得到第1,3,2,4段,这是不正常的。我需要确保段落顺序也正确。我能做些什么来实现这个目标?

编辑:问题中的问题只是一个例子,它不保证偶数和奇数段之间的交错...让我澄清一下我的问题:我想有办法提取段落IN SEQUENCE无论具有&lt; p>或不。

3 个答案:

答案 0 :(得分:-1)

BeautfulSoup4也有递归模式,默认情况下已启用。

from bs4 import BeautifulSoup
html = """
<div id='essay'>
  this is paragraph1
  <p>this is paragraph2</p>
  this is paragraph3
  <p>this is paragraph4</p>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
r = soup.find('div', id="essay", recursive=True).text
print (r)

完美适合我。 尝试使用pip更新BeautifulSoup4。

答案 1 :(得分:-2)

如果列表长度相同,可能更容易交错,而不是编写代码以使用Beautiful Soup来解决原始格式

from itertools import chain

list_a = ['this is paragraph1', 'this is paragraph3']
list_b = ['this is paragraph2', 'this is paragraph4']

print(list(chain.from_iterable(zip(list_a, list_b))))


# ['this is paragraph1', 'this is paragraph2', 'this is paragraph3', 'this is paragraph4']

此处有更多信息:Interleaving Lists in Python

答案 2 :(得分:-2)

以下似乎有效

import bs4

soup = bs4.BeautifulSoup("""
<div id='essay'>
this is paragraph1
<p>this is paragraph2</p>
this is paragraph3
<p>this is paragraph4</p>
</div>
""", "lxml")

main = soup.find('div', id='essay')
for child in main.children:
    print(child.string)