我想从此特定页面的搜索结果中抓取class =“_ 1UoZlX”的锚点链接 - https://www.flipkart.com/search?as=on&as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4io
当我从页面创建汤时,我意识到搜索结果是使用React JS呈现的,因此我无法在页面源(或汤)中找到它们。
这是我的代码
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
listUrls = ['https://www.flipkart.com/search?as=on&as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4iof']
PHANTOMJS_PATH = './phantomjs'
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
urls=[]
for url in listUrls:
browser.get(url)
wait = WebDriverWait(browser, 20)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
soup = BeautifulSoup(browser.page_source,"html.parser")
results = soup.findAll('a',{'class':"_1UoZlX"})
for result in results:
link = result["href"]
print link
urls.append(link)
print urls
这是我遇到的错误。
Traceback (most recent call last):
File "fetch_urls.py", line 19, in <module>
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Screenshot: available via screen
有人在this answer中提到有一种方法可以使用selenium来处理页面上的javascript。有人可以详细说明吗?我做了一些谷歌搜索,但找不到适用于这种特殊情况的方法。
答案 0 :(得分:2)
您的代码没有问题,但是您正在抓取的网站 - 它不会因某些原因而停止加载,这会阻止解析页面和您编写的后续代码。
我尝试与维基百科确认相同的内容:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
listUrls = ["https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"]
# browser = webdriver.PhantomJS('/usr/local/bin/phantomjs')
browser = webdriver.Chrome("./chromedriver")
urls=[]
for url in listUrls:
browser.get(url)
soup = BeautifulSoup(browser.page_source,"html.parser")
results = soup.findAll('a',{'class':"mw-redirect"})
for result in results:
link = result["href"]
urls.append(link)
print urls
<强>输出:强>
[u'/wiki/List_of_states_and_territories_of_India_by_area', u'/wiki/List_of_Indian_states_by_GDP_per_capita', u'/wiki/Constitutional_republic', u'/wiki/States_and_territories_of_India', u'/wiki/National_Capital_Territory_of_Delhi', u'/wiki/States_Reorganisation_Act', u'/wiki/High_Courts_of_India', u'/wiki/Delhi_NCT', u'/wiki/Bengaluru', u'/wiki/Madras', u'/wiki/Andhra_Pradesh_Capital_City', u'/wiki/States_and_territories_of_India', u'/wiki/Jammu_(city)']
P.S。我正在使用Chrome驱动程序,以便针对真正的Chrome浏览器运行脚本以进行调试。从https://chromedriver.storage.googleapis.com/index.html?path=2.27/
下载Chrome驱动程序答案 1 :(得分:0)
Selenium将呈现包含Javascript的页面。您的代码正常运行。它正在等待生成元素。在你的情况下,Selenium没有得到那个CSS元素。您提供的URL不呈现结果页面。而不是那样,它正在生成以下错误页面。
此页面没有CSS类。您的代码正在等待特定的CSS元素。尝试Firefox
网络驱动程序,看看发生了什么。