使用美丽的汤和请求库我能够抓取HTML内容,但不能通过JavaScript或AJAX调用来加载。
如何通过我的Python脚本模仿这个?因为我们滚动页面时会加载YouTube评论。我发现了2种方法;一个使用Selenium而另一个使用lxml请求,我无法理解。
示例(this is the video):
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.youtube.com/watch?v=iFPMz36std4'
response = requests.get(url)
page_html = response.content
#print page_html
page_soup=soup(page_html,"html.parser")
print page_soup
答案 0 :(得分:-1)
您需要使用selenium:
这是一个技巧,Youtube只在你向下滚动视频时加载注释,如果你滚动到底部或其他地方,注释将不会加载,所以首先滚动到那个向下部分并等待加载注释后滚动到底部或随时随地:
from selenium import webdriver
import time
driver=webdriver.Chrome()
driver.get('https://www.youtube.com/watch?v=iFPMz36std4')
driver.execute_script('window.scrollTo(1, 500);')
#now wait let load the comments
time.sleep(5)
driver.execute_script('window.scrollTo(1, 3000);')
comment_div=driver.find_element_by_xpath('//*[@id="contents"]')
comments=comment_div.find_elements_by_xpath('//*[@id="content-text"]')
for comment in comments:
print(comment.text)
输出的某些部分:
#can't post full output its too long
I love Kygo's Stranger Things and Netflix's Stranger Things <3
Stranger Things, Kygo and OneRepublic, could it be better?
Amazing Vibe!!!!!!!!!
答案 1 :(得分:-1)
虽然我有不同的向下滚动方式。此功能将帮助您通过定期调用javascript向下滚动,并检查窗口的高度是否在实际向下滚动之间发生变化。
def scrollDown(pause, driver):
"""
Function to scroll down till end of page.
"""
import time
lastHeight = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(pause)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
# Main Code
driver = webdriver.Chrome()
# Instantiate browser and navigate to page
driver.get('https://www.youtube.com/watch?v=iFPMz36std4')
scrollDown(6, driver)
# Page soup
soup = BeautifulSoup(driver.page_source, "html.parser")