美丽的汤忽略内在的HTML

时间:2012-07-31 13:35:16

标签: python beautifulsoup

我有以下html,我只想获取产品名称并忽略其余的html。我可以这样做吗

我希望将此作为使用beautifulsoup Apple iPhone 4 Verizon

的oputpout
  <h1 itemprop="itemreviewed">Apple iPhone 4 Verizon    
                        <div class="right">
  <span class="s_button_follow_special" style="display: block">
  <a href="javascript:;" style="display: block" onclick="subscribe(this, 1, 5132);" class="follow_1_5132 s_button_2 s_button_follow" title="Follow Apple iPhone 4 Verizon"><em class="s_icon s_icon_follow"></em>Follow</a>
  <a class="s_button_2 s_button_follow_arrow" href="javascript:;" onclick="subscribe(this, 1, 5132, '', 2);"></a>
  </span>
  <a href="javascript:;" style="display: none" onclick="subscribe(this, 1, 5132);" class="unfollow_1_5132 s_button_2 s_button_follow_disabled s_button_following" title="Unfollow Apple iPhone 4 Verizon"><span><em class="s_icon s_icon_following"></em>Following</span></a>
  </div>
  </h1>


  header= soup('h1', {'itemprop' : 'itemreviewed'})

2 个答案:

答案 0 :(得分:0)

类似

soup = BeautifulSoup(<h1 ....)
header = soup.h1['itemprop'].contents

答案 1 :(得分:0)

Apple iPhone 4 Verizon文本是解析树中自己的元素,与其他元素分开;您可以通过提取附近的元素并使用nextSiblingpreviousSiblingnextprevious进行导航来选择它。

所以这应该有效:

header = soup.find('h1', itemprop='itemreviewed')
text = header.next