Python - 提取<iframe>源代码

时间:2017-03-08 04:18:48

标签: python iframe

我有&lt; iframes&gt; 的嵌套列表:

&#xA;&#xA;
  iframes = [&#xA; [&lt; iframe数据懒惰-SRC =“https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/309819830&amp;color=ff5500&amp;auto_play=false&amp ; hide_related = false&amp; amp; show_comments = true&amp; amp; amp_user = true&amp; amp; show_reposts = false“frameborder =”no“height =”166“scrolling =”no“src =”data:image / gif; base64,R0lGODdhAQABAPAAAP /// wAAACwAAAAAAQABAEACAkQBADs =“width =”100%“&gt;&lt; / iframe&gt;,&lt; iframe allowtransparency =”true“data-lazy-src =”// www.facebook.com/plugins/likebox.php?href = HTTPS%3A%2F%2Fwww.facebook.com%2FPauseMusicale&放大器;放大器;宽度= 300&放大器;放大器;高度= 62&放大器;放大器; show_faces =假放大器;放大器;色彩方案=光&放大器;放大器;流=假放大器;放大器; show_border =假安培; amp; header = false“frameborder =”0“scrolling =”no“src =”data:image / gif; base64,R0lGODdhAQABAPAAAP /// wAAACwAAAAAAQABAEACAkQBADs =“style =”border:none; overflow:hidden; width:300px;高度:62px;“&GT;&LT; / iframe中&GT; ,&lt; iframe allowfullscreen =“”data-lazy-src =“// www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1”frameborder =“0”height =“169”src =“data:image / gif; base64,R0lGODdhAQABAPAAAP /// wAAACwAAAAAAQABAEACAkQBADs =“width =”100%“&gt;&lt; / iframe&gt;],[&lt; iframe data-lazy-src =”https://w.soundcloud.com/player/?url= HTTPS%3A // api.soundcloud.com /音轨/ 310079005&放大器;放大器;颜色= ff5500&放大器;放大器; auto_play =假放大器;放大器; hide_related =假放大器;放大器; show_comments =真放大器;放大器; show_user =真放大器;放大器; show_reposts =假“frameborder =”no“height =”166“scrolling =”no“src =”data:image / gif; base64,R0lGODdhAQABAPAAAP /// wAAACwAAAAAAQABAEACAkQBADs =“width =”100%“&gt;&lt; / iframe&gt;,&lt; iframe allowtransparency =“true”data-lazy-src =“// www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&amp;width=300&amp ; height = 62&amp; amp; show_faces = false&amp; amp; colorscheme = light&amp; amp; stream = false&amp; amp; amp_border = false&amp; amp; header = false“frameborder =”0“scrolli ng =“no”src =“data:image / gif; base64,R0lGODdhAQABAPAAAP /// wAAACwAAAAAAQABAEACAkQBADs =”style =“border:none;溢出:隐藏;宽度:300像素;高度:62px;“&gt;&lt; / iframe&gt;,&lt; iframe allowfullscreen =”“data-lazy-src =”// www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1“frameborder =”0“height = “169”src =“data:image / gif; base64,R0lGODdhAQABAPAAAP /// wAAACwAAAAAAQABAEACAkQBADs =”width =“100%”&gt;&lt; / iframe&gt;],&#xA; [&lt; iframe etc],&#xA ; [&lt; iframe etc]]&#xA;  
&#xA;&#xA;

我希望获取所有 ['data-lazy-src'] < / code> from it。

&#xA;&#xA;

我正在使用此代码:

&#xA;&#xA;
  for iframe中的iframe:&#xA; for iframe中的i:&#xA; scheme,netloc,path,params,query,fragment = urlparse(i.attrs ['data-lazy-src'])&#xA; if if scheme:&#xA; scheme ='http'&#xA; url = urlunparse((scheme,netloc,path,params,query,fragment))&#xA; print('Fetching {}'。format(url)) &#xA; f = urllib2.urlopen(url)&#xA;  
&#xA;&#xA;

但我得到了:

&#xA;& #xA;
 获取http://www.youtube.com/embed/video series?list = PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1&#xA;获取http://www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1
获取http://www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1
  
&#xA;&#xA;

我知道我错过了一些非常明显的东西,但我看不到它。

&#xA;&#xA;

有人可以帮我吗?

&#xA;

2 个答案:

答案 0 :(得分:1)

您可以从iframes获取html字符串,然后将其传递给BeautifulSoup以便轻松解析。尝试这样的事情。

from bs4 import BeautifulSoup

iframe = '<iframe data-lazy-src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/309819830..." frameborder="no"></iframe>'

soup = BeautifulSoup(iframe, 'html.parser')
tag = soup.find_all('iframe')[0]
print(tag['data-lazy-src'])

答案 1 :(得分:0)

问题在于生成嵌套列表的方式,将soup.find_all('iframe')追加到iframes = []

删除附加部分后,其工作方式如下:

   (...)

    iframes = soup.find_all('iframe')  

for iframe in iframes:
    scheme, netloc, path, params, query, fragment = urlparse(iframe.attrs['data-lazy-src'])
    if not scheme:
        scheme = 'http' # default scheme you used when getting the current page
    url = urlunparse((scheme, netloc, path, params, query, fragment))
    print('Fetching {}'.format(url))
    f = urllib2.urlopen(url)

结果:

Fetching https://www.youtube.com/embed/OWr5FawT2Ks?rel=0
Fetching https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/308112514&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false
Fetching http://www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false
Fetching http://www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1