我在笔记本上运行这个web-scrapper - 它使用Firefox(selenium - webdriver)来获取数据 - 它必须实际打开Firefox,因为数据是由JavaScript创建的。所以我想知道专用服务器是否可以打开Firefox并获取数据 - 我认为专用服务器没有显示所以它不起作用?脚本要复杂得多(好152行) - 我只粘贴了我认为不起作用的部分。我相信将数据导入PostgreSQL在专用服务器中没有问题。
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import lxml
import re
import psycopg2
import sys
driver = webdriver.Firefox()
driver.set_window_position(-9999, -9999)
driver.get("http://rodos.vsb.cz/Road.aspx?road=D2")
time.sleep(20) #waits till the page loads
html_source = driver.page_source
soup = BeautifulSoup(html_source, 'lxml')
# finds tags with speed information (km/h)
for i in (soup.find_all("tspan", {"id" : re.compile("tspan_Label_\w*")})):
if re.match("^[0-9]+$", (str(i.getText()))) is not None:
if (str(i.parent.get('fill'))) == '#5f5f5f':
list1.append(i.getText())
答案 0 :(得分:1)
我认为您可能正在寻找的是 pyvirtualdisplay:
pip install pyvirtualdisplay
pyvirtualdisplay将在内存中模拟您选择的浏览器,而无需实际打开浏览器。
from pyvirtualdisplay import Display
from selenium import webdriver
# Set screen resolution to 1366 x 768 like most 15" laptops
display = Display(visible=0, size=(1366, 768))
display.start()
# now Firefox will run in a virtual display.
browser = webdriver.Firefox()
# Sets the width and height of the current window
browser.set_window_size(1366, 768)
# Open the URL
browser.get('http://rodos.vsb.cz/Road.aspx?road=D2')
# set timeouts
browser.set_script_timeout(30)
browser.set_page_load_timeout(30) # seconds
time.sleep(20) #waits till the page loads
html_source = driver.page_source
soup = BeautifulSoup(html_source, 'lxml')
# finds tags with speed information (km/h)
for i in (soup.find_all("tspan", {"id" : re.compile("tspan_Label_\w*")})):
if re.match("^[0-9]+$", (str(i.getText()))) is not None:
if (str(i.parent.get('fill'))) == '#5f5f5f':
list1.append(i.getText())