我正在按照指南进行操作: https://medium.com/swlh/tutorial-web-scraping-instagrams-most-precious-resource-corgis-235bf0389b0c
我过去曾经使用过它,但是由于某种原因,它现在正在重现如下所示的空数组,而不是永久链接列表
C:\Users\19053\InstagramPublicImageDownloader\venv\Scripts\python.exe C:/Users/19053/InstagramPublicImageDownloader/getpermalinks.py
[]
[]
[]
[]
[]
[]
[]
[]
应该像
['https://www.instagram.com/p/CDRbCxjBakW/','https://www.instagram.com/p/CDMQ9J2Fvl4/','...and so on']
代码如下:
from selenium.webdriver import Chrome
url = "https://www.instagram.com/dairyqueen/"
browser = Chrome()
browser.get(url)
post = 'https://www.instagram.com/p/'
post_links = []
while len(post_links) < 25:
links = [a.get_attribute('href') for a in browser.find_elements_by_tag_name('a')]
for link in links:
if post in link and link not in post_links:
post_links.append(link)
scroll_down = "window.scrollTo(0, document.body.scrollHeight);"
browser.execute_script(scroll_down)
time.sleep(10)
else:
print(post_links[:25])
答案 0 :(得分:0)
要收集您想要的网址,请使用此CSS选择器div.v1Nh3.kIKUG._bz0w > a
,然后使用WebDriverWait
而不是time.sleep(...)
。
您应该将放置位置滚动到循环块的底部,然后重复进行直到达到您期望的元素数量。
尝试以下代码:
browser.get('https://www.instagram.com/dairyqueen/')
scroll_down = "window.scrollTo(0, document.body.scrollHeight);"
while True:
links = WebDriverWait(browser, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'div.v1Nh3.kIKUG._bz0w > a')))
if(len(links) < 25):
browser.execute_script(scroll_down)
else:
break
post_links = []
for link in links:
post_links.append(link.get_attribute('href'))
print(post_links[:25])
正在导入:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC