如何使用selenium
与firefox
一起抓取网站?
echo "deb http://packages.linuxmint.com debian import" >> /etc/apt/sources.list && apt-get update
apt-get install firefox xvfb python-dev python-pip
pip install pyvirtualdisplay selenium
from pyvirtualdisplay import Display
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
display = Display(visible=0, size=(800, 600))
display.start()
def init_driver():
driver = webdriver.Firefox()
driver.wait = WebDriverWait(driver, 5)
return driver
def lookup(driver, query):
driver.get("http://www.google.com")
try:
box = driver.wait.until(EC.presence_of_element_located(
(By.NAME, "q")))
button = driver.wait.until(EC.element_to_be_clickable(
(By.NAME, "btnK")))
box.send_keys(query)
button.click()
except TimeoutException:
print("Box or Button not found in google.com")
if __name__ == "__main__":
driver = init_driver()
lookup(driver, "Selenium")
time.sleep(5)
driver.quit()
display.stop()
File "selenium_scrape.py", line 20
box = driver.wait.until(EC.presence_of_element_located(
^
IndentationError: expected an indented block
答案 0 :(得分:4)
不同之处在于您无法使用打包的Chrome浏览器;你需要一个特殊的司机...... chromedriver。
在此处获取最新版本: Chromedriver
现在您有2个选项,要么移动下载的chromedriver,以便始终可以访问它(选项1),要么在脚本中定义如何访问它。
然后移动它,以便在使用webdriver.Chrome()
时可以访问它:
sudo mv /path/to/download/chromedriver /usr/bin
还将其设置为允许执行:
chmod a+x /usr/binchromedriver
或者您可以定义路径
import os
chr = "/Users/you/Downloads/chromedriver"
os.environ["webdriver.chrome.driver"] = chr
driver = webdriver.Chrome(chromedriver)
答案 1 :(得分:2)
(注意:最初的问题是关于Chrome,所以我的答案是关于Chrome,而不是Firefox)。
对我而言,如果我只是将chromedriver提取到脚本所在的同一个文件夹中,它就可以工作。
然后我按照这个
运行它Xvfb :99 -ac -screen 0 1280x1024x16 &
echo 'Starting the test'
PATH=$PATH:. python selenimum_scrape.py
这将启动Xvfb并将cromedriver包含在PATH
。
你的修改版本对我有用:
import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# comment this out to run on the real display
os.environ['DISPLAY'] = ':99'
def init_driver():
driver = webdriver.Chrome()
driver.wait = WebDriverWait(driver, 5)
return driver
def lookup(driver, query):
driver.get("http://www.google.com")
try:
box = driver.wait.until(EC.presence_of_element_located(
(By.NAME, "q")))
# once we type the query, this button disappears
# button = driver.wait.until(EC.element_to_be_clickable(
# (By.NAME, "btnK")))
box.send_keys(query)
button = driver.wait.until(EC.element_to_be_clickable(
(By.NAME, "btnG")))
button.click()
except TimeoutException:
print("Box or Button not found in google.com")
if __name__ == "__main__":
driver = init_driver()
lookup(driver, "Selenium")
time.sleep(5)
driver.quit()
答案 2 :(得分:0)
问题是(目前)有关缩进错误的问题。这很容易解决:
gm.isAvailable