用python中的硒将所有的评论及其回复都刮到youtube上

时间:2019-05-30 12:05:50

标签: python selenium web-scraping youtube selenium-chromedriver

我正在尝试抓取youtube视频评论及其回复,喜欢的评论,不喜欢的评论,评论计数,回复计数。

首先,我尝试使用基于ID的python硒Google驱动程序来抓取注释及其回复之类的文本数据。

我只能抓取页面上可用的评论,而不能回帖。

回复无法实现。

<target name="A" xsi:type="Mail" from="" to="" subject="" smtpServer="" smtpPort="0" skipCertificateValidation="true">
      <layout xsi:type="JsonLayout" includeAllProperties="true">
        <attribute name="text" layout="${message}" />
        <attribute name="level" layout="${level:upperCase=true}"/>
        <attribute name="fileName" layout="${var:fileName}"/>
        <attribute name="logGroupName" layout="${var:logGroupName}"/>
        <attribute name="logStreamName" layout="${var:logStreamName}"/>
        <attribute name="category" layout="${logger}" />
        <attribute name="exception" layout="${exception:format=@}" encode="false"/>
      </layout>
    </target>

<target name="B" xsi:type="Mail" from="" to="" subject="" smtpServer="" smtpPort="0" skipCertificateValidation="true">
  <layout xsi:type="JsonLayout" includeAllProperties="true">
    <attribute name="text" layout="${message}" />
    <attribute name="level" layout="${level:upperCase=true}"/>
    <attribute name="fileName" layout="${var:fileName}"/>
    <attribute name="logGroupName" layout="${var:logGroupName}"/>
    <attribute name="logStreamName" layout="${var:logStreamName}"/>
    <attribute name="category" layout="${logger}" />
    <attribute name="exception" layout="${exception:format=@}" encode="false"/>
  </layout>
</target>

使用上面的代码,我只能抓取注释。如何在python中使用硒来删除这些评论的回复,喜欢,不喜欢,日期。

任何人都可以帮助我建议我哪里出错了。

更新后的代码(空数组)

 // Label align for Y-axis
$graph->yaxis->SetLabelAlign('center','bottom');
// Titles
// @aici
$graph->title->Set('Difference');
$graph->title->SetFont(FF_ARIAL, FS_BOLD, 14);

// Create a bar pot
$bplot = new BarPlot($yAxis);
//$bplot->SetFillColor('orange');
    foreach ($yAxis as $datayvalue) {
    if ($datayvalue < '0') $barcolors[]='yellow';
    elseif ($datayvalue >= '0' ) $barcolors[]='blue';

}

$bplot->SetFillColor($barcolors);
$bplot->SetWidth(0.5);
$bplot->SetYMin(100);
$bplot->value->SetFont(FF_ARIAL, FS_NORMAL, 10.5);
$bplot->SetWeight(0);
//$bplot->numpoints = 1;

$graph->Add($bplot);

我的更新代码:(1-05-2019)

import time
import csv
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_path = "/Users/Downloads/chromedriver"
page_url = "https://www.youtube.com/watch?v=AJesAlohO6I&t=" 


driver = webdriver.Chrome(executable_path=chrome_path)
driver.get(page_url)
time.sleep(2)  


title = driver.find_element_by_xpath('//*[@id="container"]/h1/yt-formatted-string').text
print(title)


SCROLL_PAUSE_TIME = 2
CYCLES = 100

html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.PAGE_DOWN)  
html.send_keys(Keys.PAGE_DOWN)  
time.sleep(SCROLL_PAUSE_TIME * 3)

for i in range(CYCLES):
    html.send_keys(Keys.END)
    time.sleep(SCROLL_PAUSE_TIME)


comment_elems = driver.find_elements_by_xpath('//*[@id="content-text"]')
all_comments = [elem.text for elem in comment_elems]
print(all_comments)

write_file = "output_testing.csv"
with open(write_file, "w") as output:
    for line in all_comments:
        output.write(line + '\n')

我的实际输出:

import time
import csv
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_path = "/Users/Downloads/chromedriver"
page_url = "https://www.youtube.com/watch?v=qBp1rCz_yQU" 


driver = webdriver.Chrome(executable_path=chrome_path)
driver.get(page_url)
time.sleep(2)  


title = driver.find_element_by_xpath('//*[@id="container"]/h1/yt-formatted-string').text
print(title)


SCROLL_PAUSE_TIME = 2
CYCLES = 100

html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.PAGE_DOWN)  
html.send_keys(Keys.PAGE_DOWN)  
time.sleep(SCROLL_PAUSE_TIME * 3)

for i in range(CYCLES):
    html.send_keys(Keys.END)
    time.sleep(SCROLL_PAUSE_TIME)

driver.find_elements_by_xpath('//div[@id="replies"]/ytd-comment-replies-renderer/ytd-expander/paper-button[@id="more"]')

comment_elems = driver.find_elements_by_xpath('//div[@id="loaded-replies"]//yt-formatted-string[@id="content-text"]')
all_comments = [elem.text for elem in comment_elems]
print(all_comments)

write_file = "output_31may.csv"
with open(write_file, "w") as output:
    for line in all_comments:
        output.write(line + '\n')

我得到答复内容消息的预期输出。但是我只能获取回复计数。

1 个答案:

答案 0 :(得分:0)

您需要点击查看重播以抓取评论回复。

点击该按钮,您可以执行以下操作:

driver.find_elements_by_xpath('//div[@id="replies"]/ytd-comment-replies-renderer/ytd-expander/paper-button[@id="more"]').click()

然后是抓取答复

driver.find_elements_by_xpath('//div[@id="loaded-replies"]//yt-formatted-string[@id="content-text"]')