使用urllib2从网页中提取信息

时间:2013-05-11 21:21:32

标签: python html parsing urllib2

写一个在YouTube上搜索短语的程序的一部分,然后我希望它获取第一个视频的网址。但我不知道如何获得第一个视频的网址

这是我的代码:

import urllib2, urllib

raw_i=raw_input("Search: ")
x = urllib.quote_plus(raw_i)
site1 = urllib2.urlopen('http://www.youtube.com/results?search_query=%s'%x)
y = site1.read()

这会读取搜索页面,但我希望它只返回视频的网址

例如,让我们使用短语“Harry Nilsson的椰子”

这是第一个视频的HTML

 <li class="yt-lockup2 clearfix yt-uix-tile result-item-padding has-hover-effects yt-    lockup2-video yt-lockup2-tile context-data-item" data-context-item-title="Harry Nilsson -     Coconut (1971)" data-context-item-views="2,930,881 views" data-context-item-type="video"      data-context-item-id="Tbgv8PkO9eo" data-context-item-time="4:32" data-context-item-    user="Zoltán Makk">
    <div class="yt-lockup2-thumbnail">
         <a href="/watch?v=Tbgv8PkO9eo" class="ux-thumb-wrap yt-uix-sessionlink yt-uix-    contextlink contains-addto "  data-sessionlink="ved=CDIQwBs&amp;ei=prWOUZT9KIK8igLtyICAAQ">         <span class="video-thumb  yt-thumb yt-thumb-185" >
      <span class="yt-thumb-default">
        <span class="yt-thumb-clip">
          <span class="yt-thumb-clip-inner">
            <img alt="Thumbnail" src="//i1.ytimg.com/vi/Tbgv8PkO9eo/mqdefault.jpg" width="185" >
            <span class="vertical-align"></span>
      </span>
    </span>
  </span>
</span>
<span class="video-time">4:32</span>

我希望只返回"/watch?v=Tbgv8PkO9eo"

谢谢!

1 个答案:

答案 0 :(得分:1)

You can use HTMLParser。创建自己的派生自Python类的解析器。

来自HTMLParser的

导入HTMLParser

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        # Only parse the 'anchor' tag.
        if tag == "a":
           # Check the list of defined attributes.
           for name, value in attrs:
               # If href is defined, print it.
               if name == "href":
                   print name, "=", value

您使用html字符串创建解析器和feed

your_html_string='<li class="yt-lockup2 clearfix yt-uix-tile result-item- \
                  padding has-hover-effects yt-lockup2-video yt-lockup2-tile  \
                  context-data-item" data-context-item-title="Harry Nilsson - \
                  Coconut (1971)" data-context-item-views="2,930,881 views"  \
                  data-context-item-type="video" data-context-item- \
                  id="Tbgv8PkO9eo" data-context-item-time="4:32" \
                  data-context-item-user="Zoltán Makk">\
                  <div class="yt-lockup2-thumbnail">\
                  <a href="/watch?v=Tbgv8PkO9eo" class="ux-thumb-wrap \
                  yt-uix-sessionlink yt-uix-contextlink contains-addto" data-\
                  sessionlink="ved=CDIQwBs&amp;ei=prWOUZT9KIK8igLtyICAAQ">\
                  <span class="video-thumb  yt-thumb yt-thumb-185" >\
                  <span class="yt-thumb-default"> \
                  <span class="yt-thumb-clip" \
                  <span class="yt-thumb-clip-inner"> \
                  <img alt="Thumbnail" \         
                  src="//i1.ytimg.com/vi/Tbgv8PkO9eo/mqdefault.jpg"  \
                  width="185" > <span class="vertical-align"></span> \
                  </span> </span></span></span> \
                  <span class="video-time">4:32</span>'

parser = MyHTMLParser()
parser.feed(your_html_string)

结果是

>>> 
href = /watch?v=Tbgv8PkO9eo