用美丽的汤刮json数据

时间:2016-02-17 20:16:34

标签: python web-scraping beautifulsoup

我正试图在div类中搜索一个儿子数据,我试图获取" url"的数据。我使用video_link = self.soup.find('div' ,{'class':'video-embed-big'}),但我无法使用引用的网址获取该div中的数据。

<div class="video-embed-big video-embed-area bf_dom" id="video_buzz_element_4154403_7994283" rel:thumb="https://img.youtube.com/vi/_Ym0LW_uPPk/2.jpg" rel:bf_bucket_data="{"video": {"size": "big", "width":"625", "height":"376", "url":"https://youtube.com/watch?v=_Ym0LW_uPPk", "id":"4154403_7994283"}}">
  <div style="position:relative;" id="video_wrapper_4154403_7994283">     
     <iframe id="yt_4154403_7994283" class="ytvideo" type="text/html" allowscriptaccess="always" allowfullscreen="true" width="625" height="376" src="https://www.youtube.com/embed/_Ym0LW_uPPk?version=3&amp;hl=en&amp;fs=1&amp;enablejsapi=1&amp;origin=http://www.buzzfeed.com&amp;autoplay=0&amp;showinfo=0&amp;wmode=opaque" frameborder="0">
          </iframe>
     </div>
</div>

2 个答案:

答案 0 :(得分:1)

怎么样

video_div = self.soup.find('div', id=lambda d: d and d.startswith('video_wrapper_'))
video_link = video_div.find('iframe')['src']

将返回

In [5]: video_link
Out[5]: 'https://www.youtube.com/embed/_Ym0LW_uPPk?version=3&hl=en&fs=1&enablejsapi=1&origin=http://www.buzzfeed.com&autoplay=0&showinfo=0&wmode=opaque'

如果您想使用urlparse并获取实际的YouTube页面,可以更深入一点。

import urlparse

video_div = self.soup.find('div', id=lambda d: d and d.startswith('video_wrapper_'))
video_link = video_div.find('iframe')['src']
url = urlparse.urlparse(video_link)
youtube_url = urlparse.urlunparse((url[0], url[1], "watch?v=" + url[2].split('/')[2],'','',''))

这是youtube_url

的输出
In [15]: urlunparse((url[0], url[1], "watch?v=" + url[2].split('/')[2],'','',''))
Out[15]: 'https://www.youtube.com/watch?v=_Ym0LW_uPPk'

答案 1 :(得分:0)

video_link = self.soup.find('div',{'class':'video-embed-big'}).div.iframe['src']

您需要使用&#34;。&#34;运算符进入div的子进程然后使用src属性获取url。