我需要从 youtube播放列表中提取具有其名称的youtube链接。
因此,我只是尝试使用SelectorGadget
(Chrome扩展程序)提取CSS标签,但是当我尝试获取有关BeautifulSoup返回none
之类的任何信息时,我没有发现问题所在。
下面是我写的代码:
from os import sys
import requests
from bs4 import BeautifulSoup
import re
try:
# checking url format
url_pattern = re.compile("^(?:http|https|ftp):\/\/[a-zA-Z0-9_~:\-\/?#[\]@!$&'()*+,;=`^.%]+\.[a-zA-Z0-9_~:\-\/?#[\]@!$&'()*+,;=`^.%]+$")
# playlist_url = input("Enter your youtbe playlist url: ")
# getting input directly from user commandline
playlist_url = sys.argv[1]
if not bool(url_pattern.match(playlist_url)) :
raise ValueError("Enter valid link")
get_links_from_youtube_playlist(playlist_url)
except ValueError as value_error:
print(value_error)
然后将URL传递给另一个函数:
def get_links_from_youtube_playlist(youtube_playlist_url):
request_response = requests.get(youtube_playlist_url)
# using "html.parser" lib
# soup_object = BeautifulSoup(request_response.text, 'html.parser')
# using "lxml" - Processing XML and HTML with Python
soup_object = BeautifulSoup(request_response.text, 'lxml')
# not working?!
url_list = soup_object.select("#video-title")
print(url_list)
# this is not working too?!
div_content = soup_object.find("div", attrs={"class" : "content"})
print(div_content)
此外,我通过以下命令运行它:
python3 test.py https://www.youtube.com/playlist\?list\=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
在select
或f enter code here
ind方法之后打印BeautifulSoup对象时,我的输出为None。因为页面中存在ID,它是否应该找到有意义的东西?
选择器小工具仅在单击该部分时向我显示#video-title
,即使我无法访问div
我也应该如何提取链接和链接的名称?
答案 0 :(得分:0)
YouTube检查用户代理以确定要返回哪种页面。如果您发送与真实浏览器相对应的用户代理,您将得到期望的响应。 video-title
是类,而不是ID,因此将选择器更改为.video-title
。
import pprint
from bs4 import BeautifulSoup
import requests
pp = pprint.PrettyPrinter()
def get_links_from_youtube_playlist(youtube_playlist_url):
request_response = requests.get(youtube_playlist_url, headers={"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"})
soup_object = BeautifulSoup(request_response.text, 'lxml')
url_list = soup_object.select(".video-title")
pp.pprint(url_list)
get_links_from_youtube_playlist('https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab')
输出:
[<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>,
<div class="video-title text-shell skeleton-bg-color"></div>]