Pinterest:从频繁更新的元素数组中获取数据

时间:2019-04-10 18:38:00

标签: python selenium-webdriver pinterest web-mining

在相册页面(https://pinterest.com/user/album/)内,您可以看到排列的元素。我正在使用具有默认大小窗口的Firefox Webdriver。

所有元素都具有相同的类(class =“ YlMIw Hb7”),它们一次加载约20个。我正在做的是迭代每个元素,进入他的视图页面,获取数据,返回主页等等。

当到达数组的最后一个元素时,它向下滚动到该元素,并且迭代器再次从第一个元素开始。这种方法有两个问题:

  1. 重置迭代器时会跳过某些元素。
  2. 极有可能出现“ IndexError:列表索引超出范围” 由于驱动程序移动时数组大小的变化 浏览页面并加载新元素。

我几乎在每一行中都发表了评论,但是如果您有任何疑问,请询问。我很乐意收到一些建议甚至是实现主要目标的新策略,并从页面中获取所有数据。预先感谢!

在Selenium WebDriver中使用Python 3.7。

def getData():              
    i = 0
    j = 0   
    while True:
        time.sleep(1.5)

        elems = driver.find_elements_by_class_name("Yl-") # Load array of elements in page

        #see if we are in limit
        if len(elems) == i:
            #driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") Old way: Go to bottom of the page
            driver.execute_script("arguments[0].scrollIntoView();", elems[i]) # New way: Go to the last element.
            time.sleep(2) # Wait new elements to load
            elems = driver.find_elements_by_class_name("Yl-") # Reload array
            i = 0 # Reset

        elems[i].click() # Go to element view page
        time.sleep(0.75)

        try:
            elems = driver.find_elements_by_class_name("hCL") # Array with some itens, the image we want is the third one
        except NoSuchElementException:
            print("...")        

        imgsrc = elems[2].get_attribute('src') # Image source link
        imgsrc2 = imgsrc.replace("564x", "originals", 1) # Get better quality image

        try:
            urllib.request.urlretrieve(imgsrc2, "files/file" + str(j) + ".jpg")
        except:
            urllib.request.urlretrieve(imgsrc, "files/file" + str(j) + ".jpg") # If better qiality image isn't avaliable, get the normal one
            pass    

        try:
            elems = driver.find_elements_by_class_name("gUZ") # Array with some itens, the back button is the third last
        except NoSuchElementException:
            print("Can't find back button...")

        elems = driver.find_elements_by_class_name("gUZ")
        elems[-3].click() # Go back to album page
        i = i + 1
        j = j + 1

0 个答案:

没有答案