我正试图用硒& sc刮掉Instagram(不,我没有使用api)标签。 bs4但不断收到错误:
“元素目前无法互动,可能无法操纵”
我已经尝试等待页面加载,但无论我做什么,我都会得到一个空白的打印语句或该错误。我搜索过并发现了一些过时的答案,所以我决定最后问这里。
def scrape(browser):
browser.get("https://www.instagram.com/instagram/")
tag = input("Enter a hashtag you would like to search: ")
# ig search bar
search = browser.find_element_by_css_selector('._9x5sw')
if tag != '#':
search.send_keys('#' + tag)
else:
search.send_keys(tag)
# scrape IG hash tags
soup = BeautifulSoup(browser.page_source, 'html.parser')
time.sleep(5)
for soup in soup.find_all('a', {'class': '_k2vj6'}):
print(soup)
答案 0 :(得分:1)
我能够使用它(使用firefox和phantomjs)
from selenium import webdriver
from bs4 import BeautifulSoup
import time
if __name__ == '__main__':
tag = input("Enter a hashtag you would like to search: ")
url = 'https://www.instagram.com/instagram/'
driver = webdriver.PhantomJS('<yourPathToPhantomJS>')
driver.set_window_size(1124, 850)
# driver = webdriver.Firefox()
driver.get(url)
search = driver.find_elements_by_tag_name('input')
if tag != '#':
search[0].click()
search[0].send_keys('#' + tag)
else:
search[0].send_keys(tag)
time.sleep(10)
html = driver.page_source
soup = BeautifulSoup(html)
links = soup.find_all('a', {'class': '_k2vj6'})
for link in links:
print(link)
两个尼特:
答案 1 :(得分:1)
这个如何让它实时DOM和加载js,享受并节省你的搜索时间,想法是得到整个身体,如果你还想把身体替换掉,它会和硒完全一样。
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
dri = webdriver.Chrome(options=options)
html = dri.find_element_by_tag_name("body").get_attribute('innerHTML')
soup = BeautifulSoup(html, features="lxml")