Question

我想通过使用python正则表达式提取youtube视频元信息，这里是我发现的提取视频网址和视频标题

import re
import urllib2

#https://www.youtube.com/user/<channel_name>/videos
#Here the channel name is schooloflifechannel
site = "https://www.youtube.com/user/schooloflifechannel/videos"
videos = [] 
#re_filter defines a filter to find all the anchor tags using regular expressions
m_filter = r'<ul.*?/ul>'
read_url = urllib2.urlopen(site)
re_filter = r'<a.*?/a>'
all_links_temp = re.findall(re_filter,read_url.read(),re.MULTILINE)
#This varialbe denotes class of each video link of the channel
video_link_class = 'class="yt-uix-sessionlink yt-uix-tile-link'
video_titles = []
video_links = []
#This loop finds all the titles and links for videos and stores them in two seperate lists. 
#You could also use a dictionary here which might make things easier for parsing further
for temp in all_links_temp:
    if video_link_class in temp:
        title_filter = r'title=".*?"'
        video_title_temp = re.findall(title_filter,temp,re.MULTILINE)
        video_titles.append(video_title_temp)
    url_filter = r'href=".*?"'
    video_link_temp = re.findall(url_filter,temp,re.MULTILINE)
    video_links.append(video_link_temp)
#This simple prints the Title followed by the link of each video in the lists
j = 0       
for i in video_titles:
    temp_title = str(i)
    temp_title = temp_title[9:-3]
    temp_link = str(video_links[j])
    temp_link = "https://www.youtube.com"+temp_link[8:-3]
    videos.append((temp_link, temp_title))
    j = j + 1
print videos

现在我还想从 class =＆＃34; yt-lockup-meta-info＆＃34; 中提取元信息，并以这种格式将所有结果连接在一起（＆＃34） ; VIDEO_URL＆＃34;，＆＃34; VIDEO_TITLE＆＃34;，＆＃34; VIDEO_UPLOAD_DATE＆＃34;）。我怎么能用python re

来做到这一点

使用正则表达式提取youtube元信息

0 个答案: