使用 Beautiful Soup 进行网页抓取时遇到问题

时间:2021-06-06 18:35:12

标签: python python-3.x web-scraping beautifulsoup

我很难抓取这个网页 top-programming-guru
我希望检索页面中列出的所有 YouTube 频道的列表。
我正在使用 BeautifulSoup,我查看了页面的源代码,然后尝试使用以下代码:

import requests
from bs4 import BeautifulSoup

URL = 'https://noonies.tech/award/top-programming-guru'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'lxml')

resluts = soup.find_all('div', class_='sc-jhAzac dldLgq')
resluts

但我总是得到一个空列表。
任何想法如何正确地做到这一点?

这是我要找的标签

<div class="sc-jhAzac dldLgq">
 <p>
  ?
  <em>And the winner is...</em>
 </p>
<div class="sc-gZMcBi kTYIfA">
 <div class="nomination-info">
  <h3><i class="fad fa-trophy"></i><a href="https://www.youtube.com/c/programmingwithmosh/videos" target="_blank">Programming with Mosh</a></h3>

1 个答案:

答案 0 :(得分:1)

数据是动态加载的。使用 selenium 或类似工具允许 javascript 加载然后抓取。

from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
url = 'https://noonies.tech/award/top-programming-guru'
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
soup.find_all(href=re.compile('youtube.com'))

输出包含 youtube.com 的 href 列表。如果列表捕获了您不想要的 youtube.com 链接或返回到您的课程搜索,您可能需要清理该列表。

[<a href="https://www.youtube.com/c/programmingwithmosh/videos" target="_blank">Programming with Mosh</a>,
 <a href="https://www.youtube.com/user/TechGuyWeb" target="_blank">Traversy Media</a>,
 <a href="https://www.youtube.com/user/schafer5" target="_blank">Corey Schafer</a>,
 <a href="https://m.youtube.com/channel/UC4JX40jDee_tINbkjycV4Sg" target="_blank">Tech With Tim</a>,
 <a href="https://www.youtube.com/user/krishnaik06/playlists" target="_blank">Krish Naik</a>,
 <a href="https://www.youtube.com/channel/UC8butISFwT-Wl7EV0hUK0BQ" target="_blank">freeCodeCamp.org</a>,
 <a href="https://www.youtube.com/c/HiteshChoudharydotcom" target="_blank">Hitesh Choudhary</a>,
 <a href="https://m.youtube.com/cleverprogrammer?uid=qrILQNl5Ed9Dz6CGMyvMTQ" target="_blank">Clever Programmer</a>,
 <a href="https://www.youtube.com/user/CalebTheVideoMaker2" target="_blank">Caleb Curry</a>,....