使用正则表达式提取youtube元信息

时间:2017-08-18 05:59:49

标签: python regex youtube

我想通过使用python正则表达式提取youtube视频元信息,这里是我发现的提取视频网址和视频标题

import re
import urllib2

#https://www.youtube.com/user/<channel_name>/videos
#Here the channel name is schooloflifechannel
site = "https://www.youtube.com/user/schooloflifechannel/videos"
videos = [] 
#re_filter defines a filter to find all the anchor tags using regular expressions
m_filter = r'<ul.*?/ul>'
read_url = urllib2.urlopen(site)
re_filter = r'<a.*?/a>'
all_links_temp = re.findall(re_filter,read_url.read(),re.MULTILINE)
#This varialbe denotes class of each video link of the channel
video_link_class = 'class="yt-uix-sessionlink yt-uix-tile-link'
video_titles = []
video_links = []
#This loop finds all the titles and links for videos and stores them in two seperate lists. 
#You could also use a dictionary here which might make things easier for parsing further
for temp in all_links_temp:
    if video_link_class in temp:
        title_filter = r'title=".*?"'
        video_title_temp = re.findall(title_filter,temp,re.MULTILINE)
        video_titles.append(video_title_temp)
    url_filter = r'href=".*?"'
    video_link_temp = re.findall(url_filter,temp,re.MULTILINE)
    video_links.append(video_link_temp)
#This simple prints the Title followed by the link of each video in the lists
j = 0       
for i in video_titles:
    temp_title = str(i)
    temp_title = temp_title[9:-3]
    temp_link = str(video_links[j])
    temp_link = "https://www.youtube.com"+temp_link[8:-3]
    videos.append((temp_link, temp_title))
    j = j + 1
print videos

现在我还想从 class =&#34; yt-lockup-meta-info&#34; 中提取元信息,并以这种格式将所有结果连接在一起(&#34) ; VIDEO_URL&#34;,&#34; VIDEO_TITLE&#34;,&#34; VIDEO_UPLOAD_DATE&#34;)。我怎么能用python re

来做到这一点

0 个答案:

没有答案