Question

我正在使用Jupyter从gfycat平台抓取剪辑链接；但我一直在努力获取所有链接（页面需要不断向下滚动，以便显示其他剪辑）。

我尝试了以下代码：（它可以工作，但仅返回首页中的剪辑，向下滚动一个。）

from selenium import webdriver
import time

driver = webdriver.Chrome('C:\\chromedriver.exe')
driver.get("https://gfycat.com/sound-gifs/search/%23funny")

last_link = ""
L = []
while True:
    elements = driver.find_elements_by_xpath("//div[@class='m-grid-container']//div[@class='grid-gfy-item']/a[@href]")
    actual_last_link_element = elements[len(elements) -1]
    for element in elements:
        t = element.get_attribute("href")
        print(t)
        L.append(t)

    driver.execute_script("arguments[0].scrollIntoView();", actual_last_link_element)
    time.sleep(10)

    actual_last_link = actual_last_link_element.get_attribute("href")
    if last_link == actual_last_link:    
        print("done..")
        break
    else:
        last_link = actual_last_link

我尝试了另一个代码，但效果也不佳：

from selenium import webdriver
import time

driver = webdriver.Chrome('C:\\chromedriver.exe')
driver.get("https://gfycat.com/sound-gifs/search/%23funny")

def getElements():
    return driver.find_elements_by_xpath("//div[@class='m-grid-container']//div[@class='grid-gfy-item']/a[@href]")

def scrollIntoElementView(element):
    driver.execute_script("arguments[0].scrollIntoView();", element)
    time.sleep(3)

def fetchLinks():
    global recursion_tracker

    links = set('')

    while True:
        links_count = len(links)
        elements = getElements()
        for element in elements:
            link = element.get_attribute("href")
            links.add(link)

        #print(len(links), links)

        scrollIntoElementView(elements[len(elements)-1])

        if links_count == len(links):
            if recursion_tracker <= 10:
                scrollIntoElementView(elements[len(elements)-1])
                recursion_tracker+=1
                fetchLinks()
            else:
                break
    print("All links: " , links)

recursion_tracker = 0

fetchLinks()

我需要帮助来收集所有剪辑链接。

向下滚动后爬取所有剪辑链接

0 个答案: