提取<p>标签之间的文本块

时间:2019-04-13 14:50:31

标签: python html regex parsing beautifulsoup

我有一些示例html,我正尝试从中解析和提取数据。数据如下所示:

static { System.loadLibrary("opencv_java");}

理想情况下,我想提取四个部分:标题,简介和演员表。到目前为止,将其解析并使用漂亮的汤液提取每个电影实例:

<div class="content">
<h1 class = “heading1”>MOVIE TITLE<h1>
<h2 class="heading2”>Synopsis</h2>
<div>
<p>this text is the synopsis of the movie.</p>
</div>
<h2 class="heading2”>Cast</h2>
<div>
<p>The cast includes</p>
<ol>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
</ol>
</div>
</div>

<div class="content">
<h1 class = “heading1”>MOVIE TITLE<h1>
<h2 class="heading2”>Synopsis</h2>
<div>
<p>this text is the synopsis of the movie.</p>
</div>
<h2 class="heading2”>Cast</h2>
<div>
<p>The cast includes</p>
<ol>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
</ol>
</div>
</div>

我已经像这样提取了每部电影:

from bs4 import BeautifulSoup

data = open("movies.txt",'r').read()
soup = BeautifulSoup(data, "html.parser")

以及每部电影的标题

movies = soup.find_all('div', attrs={'class':'content'})

非常容易,因为它们具有唯一的类属性。

我也想摘录剧情简介; movies.find_all('h1', attrs={'class':'heading1'}) 标签之间的那一行;和演员表分开,就像我对标题所做的那样。但是,到目前为止,我可以做到

<p>

您可以想像的只是给我“简介”和“发布”

2 个答案:

答案 0 :(得分:1)

这使用的是Beautiful Soup 4.7+。您应该能够使用CSS选择器轻松定位p元素。

要获得简介,我们将使用4级选择器功能:nth-child(an+b of s)。这将使我们能够选择与选择器s匹配的第一个孩子,这将是第一个h2.heading2标签,然后我们将使用+ div选择下一个div兄弟姐妹和> p来选择直接的p子代。

from bs4 import BeautifulSoup

html = """
<div class="content">
<h1 class="heading1">MOVIE TITLE</h1>
<h2 class="heading2">Synopsis</h2>
<div>
<p>this text is the synopsis of the movie.</p>
</div>
<h2 class="heading2">Cast</h2>
<div>
<p>The cast includes</p>
<ol>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
</ol>
</div>
</div>

<div class="content">
<h1 class="heading1">MOVIE TITLE</h1>
<h2 class="heading2">Synopsis</h2>
<div>
<p>this text is the synopsis of the movie.</p>
</div>
<h2 class="heading2">Cast</h2>
<div>
<p>The cast includes</p>
<ol>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
<li>Actor</li>
</ol>
</div>
</div>
"""

soup = BeautifulSoup(html, "html.parser")

for movie in soup.select('div.content'):
    print(movie.select_one('h1.heading1').text)
    print(movie.select_one(':nth-child(1 of h2.heading2) + div > p').text)
    for actor in movie.select('ol > li'):
        print(actor.text)

输出:

MOVIE TITLE
this text is the synopsis of the movie.
Actor
Actor
Actor
Actor
Actor
MOVIE TITLE
this text is the synopsis of the movie.
Actor
Actor
Actor
Actor
Actor

答案 1 :(得分:0)

您的内容包含Right Double Quotes Marks-请先替换它们。
替换错误的字符;找到概要标题;提取下一个div

# s = your html
trans = str.maketrans({8221:34})    
soup = BeautifulSoup(s.translate(trans),"html.parser")
contents = soup.find_all('div', attrs={'class':'content'})
for content in contents:
    syn = content.find('h2', text='Synopsis')
    print(syn, syn.fetchNextSiblings()[0].text)
相关问题