抓取动态HTML(YouTube评论)

时间:2017-10-31 16:10:52

标签: python web-scraping beautifulsoup python-requests dynamic-html

使用美丽的汤和请求库我能够抓取HTML内容,但不能通过JavaScript或AJAX调用来加载。

如何通过我的Python脚本模仿这个?因为我们滚动页面时会加载YouTube评论。我发现了2种方法;一个使用Selenium而另一个使用lxml请求,我无法理解。

示例(this is the video):

import requests
from bs4 import BeautifulSoup as soup

url = 'https://www.youtube.com/watch?v=iFPMz36std4'
response = requests.get(url)
page_html = response.content
#print page_html

page_soup=soup(page_html,"html.parser")
print page_soup

2 个答案:

答案 0 :(得分:-1)

您需要使用selenium:

这是一个技巧,Youtube只在你向下滚动视频时加载注释,如果你滚动到底部或其他地方,注释将不会加载,所以首先滚动到那个向下部分并等待加载注释后滚动到底部或随时随地:

from selenium import webdriver

import time

driver=webdriver.Chrome()

driver.get('https://www.youtube.com/watch?v=iFPMz36std4')

driver.execute_script('window.scrollTo(1, 500);')

#now wait let load the comments
time.sleep(5)

driver.execute_script('window.scrollTo(1, 3000);')



comment_div=driver.find_element_by_xpath('//*[@id="contents"]')
comments=comment_div.find_elements_by_xpath('//*[@id="content-text"]')
for comment in comments:
    print(comment.text)

输出的某些部分:

#can't post full output its too long
I love Kygo's Stranger Things and Netflix's Stranger Things <3
Stranger Things, Kygo and OneRepublic, could it be better?
Amazing Vibe!!!!!!!!!

答案 1 :(得分:-1)

使用Selenium可以解决问题。

虽然我有不同的向下滚动方式。此功能将帮助您通过定期调用javascript向下滚动,并检查窗口的高度是否在实际向下滚动之间发生变化。

def scrollDown(pause, driver):
    """
    Function to scroll down till end of page.
    """
    import time
    lastHeight = driver.execute_script("return document.body.scrollHeight")

    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(pause)
        newHeight = driver.execute_script("return document.body.scrollHeight")
        if newHeight == lastHeight:
            break
        lastHeight = newHeight

# Main Code
driver = webdriver.Chrome()

# Instantiate browser and navigate to page

driver.get('https://www.youtube.com/watch?v=iFPMz36std4')
scrollDown(6, driver)

# Page soup 
soup = BeautifulSoup(driver.page_source, "html.parser")