抓取instagram帖子链接时,会显示empy数组

时间:2020-07-30 18:42:23

标签: python arrays selenium web-scraping instagram

我正在按照指南进行操作: https://medium.com/swlh/tutorial-web-scraping-instagrams-most-precious-resource-corgis-235bf0389b0c

我过去曾经使用过它,但是由于某种原因,它现在正在重现如下所示的空数组,而不是永久链接列表

C:\Users\19053\InstagramPublicImageDownloader\venv\Scripts\python.exe C:/Users/19053/InstagramPublicImageDownloader/getpermalinks.py
[]
[]
[]
[]
[]
[]
[]
[]

应该像

['https://www.instagram.com/p/CDRbCxjBakW/','https://www.instagram.com/p/CDMQ9J2Fvl4/','...and so on']

代码如下:

from selenium.webdriver import Chrome

url = "https://www.instagram.com/dairyqueen/"
browser = Chrome()
browser.get(url)
post = 'https://www.instagram.com/p/'
post_links = []
while len(post_links) < 25:
    links = [a.get_attribute('href') for a in browser.find_elements_by_tag_name('a')]
    for link in links:
        if post in link and link not in post_links:
            post_links.append(link)
            scroll_down = "window.scrollTo(0, document.body.scrollHeight);"
            browser.execute_script(scroll_down)
            time.sleep(10)
        else:
            print(post_links[:25])

1 个答案:

答案 0 :(得分:0)

要收集您想要的网址,请使用此CSS选择器div.v1Nh3.kIKUG._bz0w > a,然后使用WebDriverWait而不是time.sleep(...)

您应该将放置位置滚动到循环块的底部,然后重复进行直到达到您期望的元素数量。

尝试以下代码:

browser.get('https://www.instagram.com/dairyqueen/')

scroll_down = "window.scrollTo(0, document.body.scrollHeight);"

while True:
    links = WebDriverWait(browser, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'div.v1Nh3.kIKUG._bz0w > a')))
    if(len(links) < 25):
        browser.execute_script(scroll_down)
    else:
        break

post_links = []
for link in links:
    post_links.append(link.get_attribute('href'))
    
print(post_links[:25])

正在导入:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC