Beautifulsoup如何解析h2和底层h3

时间:2018-02-02 15:14:32

标签: python python-3.x beautifulsoup lxml

我是新的(今天下午开始)解析lxml所以请耐心等待。我正在尝试使用带有美丽汤的lxml标记来解析具有超级英雄数据的csv。 我想获得两个h2标签内的文字Powers and Abilities&武器装备

soup.find_all("h2")
 >> [<h2><strong>Origin</strong></h2>,
 <h2><strong>Creation</strong></h2>,
 <h2><strong>Character Evolution</strong></h2>,
 <h2><strong>Major Story Arcs</strong></h2>,
 <h2><strong>Powers and Abilities</strong></h2>,
 <h2><strong>Weapons and Equipment</strong></h2>,
 <h2><strong>Character Profile</strong></h2>,
 <h2><strong>Alternate Realities </strong></h2>,
 <h2><strong>Other Media</strong></h2>,
 <h2>Merchandising</h2>,
 <h2><strong>Depiction and the Iconic Costume</strong></h2>,
 <h2>Popular Recognition</h2>]

当我看到武器和装备的h2的下一个兄弟(文本在h3标签中)时,我只得到一件物品(应该更多)。当我将header.text更改为武器和装备时,我得不到任何结果。

    for header in soup.find_all('h2'):
        if header.text == "Weapons and Equipment":
            nextNode = header.nextSibling
            print(nextNode.text)
            if nextNode is None:
                break
>> Lasso of Truth

权力和能力+武器确实出现在findall('h3')结果中(连同其他我不想要的东西)

soup.find_all("h3", text=True)
>> [<h3>Challenge of the Gods</h3>,
 <h3>First clash with Circe</h3>,
 <h3>The New 52</h3>,
 <h3>Meeting Zeus Other Children</h3>,
 <h3>Meeting First Born</h3>,
 <h3><strong>Superhuman Strength</strong></h3>,
 <h3><strong>Superhuman Speed</strong></h3>,
 <h3><strong>Invulnerability/Durability</strong></h3>,
 <h3><strong>Flight</strong></h3>,
 <h3><strong>Healing Factor</strong></h3>,
 <h3><strong>Divine Wisdom</strong></h3>,
 <h3><strong>Super Stamina/Agility</strong></h3>,
 <h3><strong>Great Beauty </strong></h3>,
 <h3><strong>Enhanced Sense</strong></h3>,
 <h3><strong>Other Assorted Divine Powers</strong></h3>,
 <h3><strong>God of War Powers</strong></h3>,
 <h3><strong>Martial Combat</strong></h3>,
 <h3><strong>Lasso of Truth</strong></h3>,
 <h3><strong>Bracelets of Victory</strong></h3>,
 <h3><b>Royal Tiara</b></h3>,
 <h3><strong>The Invisible Plane</strong></h3>,
 <h3><strong>Battle Armour</strong></h3>,
 <h3><strong>Martial Weapons</strong></h3>,
 <h3><strong>Magical Sword</strong></h3>,
 <h3><strong>Sandals of Hermes</strong></h3>,
 <h3><strong>Gauntlet of Atlas</strong></h3>,
 <h3><strong>Earrings</strong></h3>,
 <h3><strong>Power Rings</strong></h3>,
 <h3>War suits (or uniform)</h3>,
 <h3>The Dark Knight Strikes Again</h3>]

我整个下午都在阅读文档,但没有找到可以帮助我的例子。我真的不知道如何获得这些物品。帮助和解释将不胜感激。

数据摘录

    h2>Powers And Abilities </h2><br /><br /><ul class="plain-list">
<li>Reality Manipulation - Wanda possesses the ability to manipulate reality         
based on how hard she "wonders". The full extent of this ability is unknown, 
but it is known that she once wondered two of her enemies into non-existence 
during a battle in Las Vegas, Nevada.</li> <li>Psionic Abilities -  The full 
extent of Wanda's psionic abilities is currently unknown, but has been shown 
to include Mental Telepathy, Telekinesis, and Empathy.   <br /></li> 
<li>Superhuman Intelligence - Wanda possesses superhuman intelligence, 
including perfect memory, and data analysis. According to Woo-Z Winks, this 
may cause a problem in battle, as Wanda allows herself to battle purely on 
instinct, while her mind becomes lost in some deep philosophical point.  <br 
/></li> <li>Superhuman 
Strength<br /></li> <li>Superhuman Speed<br /></li> <li>Superhuman Agility<br />
</li> <li>Superhuman Dexterity<br /></li> <li>Superhuman Reflexes<br /></li> 
<li>Superhuman Senses</li></ul><p> <br /> </p> <h2>Paraphenalia </h2>

     <h3><ins>Weapons</ins></h3><p><strong>Lasso of truth</strong></p><p>The 
original lasso, still in existence in</p><p><strong>Harmony and 
Charity</strong></p><p>Wonder woman bracelets, posses intelligence and are 
programed with battle strategies.</p><p><strong>Invisible 
spacecraft</strong></p><p>Wonder Woman carries a invisible 

1 个答案:

答案 0 :(得分:0)

我不认为这是导致兄弟姐妹和#34;对于这种情况,方法是最好的。您最好尝试使用更好的选择器。

通过使用GET https://localhost:8080/sockjs-node/info?t=1517580112789 net::ERR_CONNECTION_REFUSED client?d420:175 [WDS] Disconnected! sockjs.js?3600:1601 GET https://localhost:8080/sockjs-node/info?t=1517580112789 net::ERR_CONNECTION_REFUSED 方法,您可以使用CSS选择器匹配多个标记:

soup.select

CSS选择器功能强大且灵活。我建议你再研究一下:https://www.w3schools.com/cssref/css_selectors.asp