使用Python中的BeautifulSoup在两个标头标签之间提取文本

时间:2017-02-25 01:05:14

标签: python html web-scraping beautifulsoup

我试图使用BeautifulSoup在Python中从维基百科页面中提取电影情节。我是Python和BeautifulSoup的新手,所以我不确定如何实际接近它。

这是输入代码。

<h2><span class="mw-headline" id="Plot">Plot</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php? title=Moana_(2016_film)&amp;action=edit&amp;section=1" title="Edit section: Plot">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>A small <a href="/wiki/Pounamu" title="Pounamu">pounamu</a> stone that is    the mystical heart of the island <a href="/wiki/Goddess" title="Goddess">goddess</a> Te Fiti is stolen by the <a href="/wiki/Demigod" title="Demigod">demigod</a> <a href="/wiki/M%C4%81ui_(mythology)" title="Māui (mythology)">Maui</a>, who was planning to give it to humanity as a gift. As Maui makes his escape, he is attacked by the lava <a href="/wiki/Demon" title="Demon">demon</a> Te Kā, causing the heart of Te Fiti as well as his power-granting magical fish hook to be lost in the ocean.</p><p>A millennium later, young Moana Waialiki, daughter and heir of the chief on the small <a href="/wiki/Polynesia" title="Polynesia">Polynesian</a> island of Motunui, is chosen by the ocean to receive the heart, but drops it when her father, Chief Tui, comes to get her. He insists the island provides everything the villagers need. But years later, fish become scarce and the island's vegetation begins dying. Moana proposes going beyond the reef to find more fish. Tui rejects her request, as sailing past the reef is forbidden.</p>`
<p>Moana's grandmother Tala shows Moana a secret cave behind a waterfall, where she finds boats inside and discovers her ancestors were voyagers, sailing and discovering new islands across the world. Tala explains that they stopped voyaging because Maui stole the heart of Te Fiti, causing Te Kā and monsters to appear in the ocean. Tala then says Te Kā's darkness has been spreading from island to island, slowly killing them. Tala gives Moana the heart of Te Fiti, which she has kept safe for her granddaughter.</p>
<p>Tala falls ill and with her dying breaths tells Moana to set sail. Moana and her pet <a href="/wiki/Rooster" title="Rooster">rooster</a> Heihei depart in a <a href="/wiki/Drua" title="Drua">drua</a> to find Maui. A <a href="/wiki/Manta_ray" title="Manta ray">manta ray</a>, Tala's reincarnation, follows. After a <a href="/wiki/Typhoon" title="Typhoon">typhoon</a> wave flips her sailboat and knocks her unconscious, she awakens the next morning on an island inhabited by Maui, who traps her in a cave and takes her sailboat to search for his fishhook. After escaping and catching up to Maui, Moana tries to convince him to return the heart, but Maui refuses, fearing its power will attract dark creatures.</p>
<p>Sentient coconut pirates called Kakamora surround the boat and steal the heart, but Maui and Moana retrieve it. Maui agrees to help return the heart, but only after he reclaims his hook, which is hidden in Lalotai, the Realm of Monsters. At Lalotai, they retrieve it by tricking Tamatoa, a giant <a href="/wiki/Coconut_crab" title="Coconut crab">coconut crab</a>. Maui teaches Moana how to properly sail and navigate. They arrive at Te Fiti, where Te Kā attacks. Maui is overpowered and Te Kā severely damages his hook and repels their boat far out to sea. Fearful that returning to fight Te Kā will destroy his hook, Maui abandons Moana.</p>
<p>Distraught, Moana begs the ocean to take the heart and choose another person to return it to Te Fiti. The spirit of Tala comes to her and encourages to find her true calling within herself. Inspired, Moana retrieves the heart from the ocean and returns to Te Fiti alone. Maui, having had a change of heart, returns to distract the lava demon, and his hook is destroyed in the battle. Realizing that Te Kā is actually Te Fiti without her heart, Moana asks the ocean to clear a path for Te Kā to approach her. She sings a song, asking Te Kā to remember who she truly is, allowing Moana to restore her heart. Te Fiti returns and gives a new canoe to Moana and a new magical hook to Maui before returning to her island form.</p>
<p>In a <a href="/wiki/Post-credits_scene" title="Post-credits scene">post-credits scene</a>, Tamatoa, who has been stranded on his back during Moana and Maui's escape, grumbles to the audience that they would help him if he was a <a href="/wiki/Sebastian_(Disney)" title="Sebastian (Disney)">Jamaican crab named Sebastian</a>.</p>
<h2><span class="mw-headline" id="Cast">Cast</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Moana_(2016_film)&amp;action=edit&amp;section=2" title="Edit section: Cast">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<div class="thumb tright">

所以我想只提取h2之间的文本,即情节。我该如何使用BeautifulSoup提取它?

编辑1:这是我现在的代码。

from BeautifulSoup import *

movie = raw_input('Enter:')
main = "http://www.wikipedia.org"
url = "http://www.wikipedia.org/wiki/"+movie+"_(disambiguation)"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# Retrieve a list of the anchor tags
# Each tag is like a dictionary of HTML attributes
tags = soup('a')
for tag in tags:
    chk = tag.get('href', None)
    chk = str(chk)
    if "film" in chk:
        final = chk

html = urllib.urlopen(main+final).read()
soup = BeautifulSoup(html)
new = []
spa = soup.findAll("span",id = "Plot")
spa_1 = soup.findAllNext("p")
for i in spa_1:
    print i

我试图到达id = Plot并尝试打印后面的所有p标签。

1 个答案:

答案 0 :(得分:2)

文件的结构是这样的:

[h2] / [span id=Plot]
...
[h2]

我们可以做的是搜索id为&#34; Plot&#34;的范围,然后浏览父级兄弟节点,收集他们的文本,直到我们到达下一个H2标题。

# collect plot in this list
plot = []

# find the node with id of "Plot"
mark = soup.find(id="Plot")

# walk through the siblings of the parent (H2) node 
# until we reach the next H2 node
for elt in mark.parent.nextSiblingGenerator():
    if elt.name == "h2":
        break
    if hasattr(elt, "text"):
        plot.append(elt.text)

# enjoy
print("".join(plot))