我正试图在youtube上刮掉每个评论的喜欢次数。
我的总体代码循环/向下滚动,但为简单起见,我只显示给我带来问题的部分。我是网络爬虫的新手。这是我尝试过的:
page_url="https://www.youtube.com/watch?v=TQG7m1BFeRc"
driver = webdriver.Chrome('C:/Users/Me/Chrome Web
Driver/chromedriver.exe')
driver.get(page_url)
html_source = driver.page_source
html = driver.find_element_by_tag_name('html')
soup=bs(html.text,'html.parser')
soup_source=bs(html_source,'html.parser')
然后我尝试提取点赞次数:
for div in soup.find_all('div', class_="style-scope ytd-comment-action-
buttons-renderer"):
a=str(div.text)
print(a)
但这不会返回任何内容。当我检查soup_source的内容时,可以看到以下要保留的剪贴信息:
<span aria-label="473 likes" class="style-scope ytd-comment-action-
buttons-renderer" hidden="" id="vote-count-left">
473
等
我尝试了一些类似的事情:
html = driver.(By.ID, 'vote-count-left')
但不起作用。如果有人可以帮助,将不胜感激。谢谢
答案 0 :(得分:1)
这将起作用:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver_path = r'C:/Users/Me/Chrome Web Driver/chromedriver.exe'
driver_path = r'D:\Programming\utilities\chromedriver.exe'
page_url = "https://www.youtube.com/watch?v=TQG7m1BFeRc"
driver = webdriver.Chrome(driver_path)
driver.get(page_url)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="owner-name"]/a')))
driver.execute_script('window.scrollTo(0, 768);')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'vote-count-left')))
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
result = [element.text.strip() for element in soup.find_all('span', {'id': 'vote-count-left'})]
result
输出:
['1.9K', '153', '36', '340', '474', '1.5K', '296', '750', '0', '18', '2K', '20', '17', '8', '192', '459', '56', '10', '0', '19']
这实际上比乍看起来要复杂,因为YouTube不会不加载评论部分直到,而您实际上会向下滚动。因此,我必须添加逻辑来等待页面完全加载并向下滚动,然后再等待更多时间直到实际加载注释为止。
此外,您应该一直在寻找span
,而不是div
-这是您的原始查询找不到任何内容的原因。
答案 1 :(得分:0)
通过ID #vote-count-middle 获得所有跨度,并获得属性 aria-label ,该属性包含喜欢并且仅使用正则表达式对数字进行精确计数。< / p>
注意:此代码尚未经过测试,但是为您要实现的目标提供了清晰的路径。
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
likeArray = driver.find_element_by_xpath('//*[@id="vote-count-middle"]')
for row in likeArray:
# Extract from span the value 000 Likes on internal html attribute
value = row.get_attribute("aria-label").text
if reg.search(value):
# Remove text (Likes text)
result = reg.search(value)
# Print result
print(result)
答案 2 :(得分:0)
如何?
html = """
<span id="vote-count-left" class="style-scope ytd-comment-action-buttons-renderer" aria-label="474 likes" hidden="">
474
</span>
"""
soup = BeautifulSoup(html, "lxml")
data = soup.find_all("span")
for i in data:
print(i.text)
输出:
474