Question

我需要从 youtube播放列表中提取具有其名称的youtube链接。因此，我只是尝试使用SelectorGadget（Chrome扩展程序）提取CSS标签，但是当我尝试获取有关BeautifulSoup返回none之类的任何信息时，我没有发现问题所在。

下面是我写的代码：

from os import sys
import requests
from bs4 import BeautifulSoup
import re

try:
    # checking url format
    url_pattern = re.compile("^(?:http|https|ftp):\/\/[a-zA-Z0-9_~:\-\/?#[\]@!$&'()*+,;=`^.%]+\.[a-zA-Z0-9_~:\-\/?#[\]@!$&'()*+,;=`^.%]+$") 

    # playlist_url = input("Enter your youtbe playlist url: ")
    # getting input directly from user commandline
    playlist_url = sys.argv[1]

    if not bool(url_pattern.match(playlist_url)) :
        raise ValueError("Enter valid link")

    get_links_from_youtube_playlist(playlist_url)

except ValueError as value_error:
    print(value_error)

然后将URL传递给另一个函数：


def get_links_from_youtube_playlist(youtube_playlist_url):

    request_response = requests.get(youtube_playlist_url)

    # using "html.parser" lib
    # soup_object = BeautifulSoup(request_response.text, 'html.parser')
    # using "lxml" - Processing XML and HTML with Python
    soup_object = BeautifulSoup(request_response.text, 'lxml')

    # not working?!
    url_list = soup_object.select("#video-title")
    print(url_list)
    # this is not working too?!
    div_content = soup_object.find("div", attrs={"class" : "content"})
    print(div_content)

此外，我通过以下命令运行它：

python3 test.py https://www.youtube.com/playlist\?list\=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

在select或f enter code here ind方法之后打印BeautifulSoup对象时，我的输出为None。因为页面中存在ID，它是否应该找到有意义的东西？

选择器小工具仅在单击该部分时向我显示#video-title，即使我无法访问div我也应该如何提取链接和链接的名称？

Answer 1

YouTube检查用户代理以确定要返回哪种页面。如果您发送与真实浏览器相对应的用户代理，您将得到期望的响应。 video-title是类，而不是ID，因此将选择器更改为.video-title。

import pprint
from bs4 import BeautifulSoup
import requests

pp = pprint.PrettyPrinter()

def get_links_from_youtube_playlist(youtube_playlist_url):

    request_response = requests.get(youtube_playlist_url, headers={"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"})

    soup_object = BeautifulSoup(request_response.text, 'lxml')
    url_list = soup_object.select(".video-title")
    pp.pprint(url_list)

get_links_from_youtube_playlist('https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab')

输出：

[<div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>]

为什么漂亮的汤选择方法返回None？

1 个答案: