我在尝试获取频道标题时遇到网络抓取工具问题。我不确定如何修复它,但是通过使用频道功能进行一些测试,当只有频道链接应该与YoutubeChannel功能一起使用时,视频链接似乎与它一起工作。
关于如何修复它的任何想法?
#Required Modules
import urllib
import re
#Defining the YouTube Video function
def YoutubeVideo():
#Making videoLink equal to whatever the user enters as their video link
videoLink = input ('\nWhat is your video link? (In quotations, with http included)\n')
#Goes to the video URL, opens it and reads the HTML file
htmlfile = urllib.urlopen(videoLink) #Searches for this URL
htmltext = htmlfile.read() #Reads the HTML file and sets it to htmltext
#Setup for the view counter
regexView = "<div class=\"watch-view-count\">(.+?)</div>" #Searches for the view count number and sets it to regexView
pattern = re.compile(regexView)
viewCount = re.findall(pattern, htmltext)
#Setup for the video title
regexTitle = "<title>(.+?)</title>" #Searches for the title of the video
patternTitle = re.compile(regexTitle)
videoTitle = re.findall(patternTitle, htmltext)
#Setup for the video upload date
regexUpload = "<strong class=\"watch-time-text\">(.+?)</strong>"
patternUpload = re.compile(regexUpload)
videoUpload = re.findall(patternUpload, htmltext)
print ("\n%s" % (videoLink)) #Prints the video link, primarily for testing
print ("\nThe title of your video is %s and has %s views.\nIt was %s." % (videoTitle, viewCount, videoUpload)) #Prints the information about the video
#Defining the YouTube Channel function
def YoutubeChannel():
#Making channelLink equal to whatever the user enters as their video link
channelLink = input ('\nWhat is your channel link? (In quotations, with http included)\n')
#Goes to the video URL, opens it and reads the HTML file
htmlfile = urllib.urlopen(channelLink) #Searches for this URL
htmltext = htmlfile.read() #Reads the HTML file and sets it to htmltext
#Setup for the channel name
channelTitle = "<title>(.+?)</title>" #Searches for the title of the video
patternChannelTitle = re.compile(channelTitle)
channelTitle = re.findall(patternChannelTitle, htmltext)
print (channelTitle)
ans = True
while ans:
print ("\n[1] Get information regarding a YouTube video.")
print ("\n[2] Get information regarding a YouTube channel.")
print ("\n[Q] Quit the application.")
ans = raw_input("\nWhat would you like to do now? ")
if ans == "1":
YoutubeVideo()
elif ans == "2":
YoutubeChannel()
elif ans == "q":
sys.exit(0)
elif ans != "":
print "Not a valid choice, try again."
答案 0 :(得分:0)
我不熟悉你用来解析html内容的内容 但你可以使用更容易的BeautifulSoup
import requests
from bs4 import BeautifulSoup
# channel url = https://www.youtube.com/channel/XXXXXX
url = "your channel link"
page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
span = soup.find('span',{'class' : 'qualified-channel-title-text'})
title =soup.find('a',{'class' : 'spf-link branded-page-header-title-link yt- uix-sessionlink'})
title = title.get('title')
print(title)
你可以看到我正在使用的html标签 用链接和小图片覆盖整个标题 并且在文本中有一个标题,并使用“spf-link brandded-page-header-title-link yt- uix-sessionlink”这个类 然后我从它获取title属性:)
希望这很有用
请注意,如果你要运行它,你将不得不安装beautifulsoup和请求 那些可以用管道轻松安装