多个网址抓取YouTube

时间:2020-08-16 02:32:16

标签: python

我有适用于单个URL的脚本,并将其保存在文件中。我想从多个网址中抓取并保存到一个文件中。帮帮我。

“”“ 主脚本,用于刮擦任何Youtube视频的评论。 例: $ python main.py YOUTUBE_VIDEO_URL “”“

从硒导入Webdriver

来自selenium.com的常见导入例外 导入系统 导入时间 将熊猫作为pd导入

def scrape(url): “” 从URL提供的Youtube视频中提取评论。 精氨酸: url(str):YouTube视频的URL 筹款: selenium.common.exceptions.NoSuchElementException: 当找不到某些要查找的元素时 “” url =“ https://www.youtube.com/watch?v=9hDe2kbCI4g&list=PLzivuVVbLcnqDasWGJSCg2euVWlpSf4S0&index=3”

# Note: replace argument with absolute path to the driver executable.
#driver = webdriver.Chrome('C:\webdrivers\chromedriver')
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")

# Navigates to the URL, maximizes the current window, and
# then suspends execution for (at least) 5 seconds (this
# gives time for the page to load).
driver.get(url)
driver.maximize_window()
time.sleep(5)

try:
    # Extract the elements storing the video title and
    # comment section.
    title = driver.find_element_by_xpath('//*[@id="container"]/h1/yt-formatted-string').text
    comment_section = driver.find_element_by_xpath('//*[@id="comments"]')
except exceptions.NoSuchElementException:
    # Note: Youtube may have changed their HTML layouts for
    # videos, so raise an error for sanity sake in case the
    # elements provided cannot be found anymore.
    error = "Error: Double check selector OR "
    error += "element may not yet be on the screen at the time of the find operation"
    print(error)

# Scroll into view the comment section, then allow some time
# for everything to be loaded as necessary.
driver.execute_script("arguments[0].scrollIntoView();", comment_section)
time.sleep(7)

# Scroll all the way down to the bottom in order to get all the
# elements loaded (since Youtube dynamically loads them).
last_height = driver.execute_script("return document.documentElement.scrollHeight")

while True:
    # Scroll down 'til "next load".
    driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")

    # Wait to load everything thus far.
    time.sleep(2)

    # Calculate new scroll height and compare with last scroll height.
    new_height = driver.execute_script("return document.documentElement.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# One last scroll just in case.
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")

try:
    # Extract the elements storing the usernames and comments.
    username_elems = driver.find_elements_by_xpath('//*[@id="author-text"]')
    comment_elems = driver.find_elements_by_xpath('//*[@id="content-text"]')
except exceptions.NoSuchElementException:
    error = "Error: Double check selector OR "
    error += "element may not yet be on the screen at the time of the find operation"
    print(error)

print("> VIDEO TITLE: " + title + "\n")
#print("> USERNAMES & COMMENTS:")

'''for username, comment in zip(username_elems, comment_elems):
    print(username.text + ":")
    print(comment.text + "\n")'''

df = pd.DataFrame(columns=["Text","Comment"])
for username, comment in zip(username_elems, comment_elems):
    df = df.append({"Text":username.text,"Comment":comment.text},ignore_index=True)
print("> SAVING THE DATA TO CSV FILE:\n")
filename="10-G-3.csv"
df.to_csv(filename) #to save into exisiting csv
print("> SAVE SUCCESSFULLY: " + filename + "\n" )

driver.close()

如果名称 ==“ 主要”: scrape(sys.argv [1])

0 个答案:

没有答案