我有一个返回URL列表标题的代码。由于我必须等待加载的URL更新后才能返回标题,所以我想知道是否可以同时加载多个URL并一次返回两个标题。
这是代码:
from pyvirtualdisplay import Display
from time import sleep
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.firefox.options import Options
display = Display(visible=0, size(800,600))
display.start()
urlsFile = open ("urls.txt", "r")
urls = urlsFile.readLines()
driver = webdriver.Firefox(executable_path='/usr/local/lib/geckodriver/geckodriver')
driver.set_page_load_timeout(60)
for url in urls:
try:
driver.get(url)
sleep(0.8)
print(driver.title)
except TimeoutException as e:
print("Timeout")
如果我尝试这样做:
driver = webdriver.Firefox(executable_path='/usr/local/lib/geckodriver/geckodriver')
driver2 = webdriver.Firefox(executable_path='/usr/local/lib/geckodriver/geckodriver')
for url in urls:
try:
driver.get(url)
driver2.get(url)
sleep(0.8)
print(driver.title)
print(driver2.title)
except TimeoutException as e:
print("Timeout")
driver2获取的URL与driver1获取的URL相同。是否可以让driver2依次获取下一个URL,以这样的方式加载它们两个而不会浪费时间?
答案 0 :(得分:1)
from multiprocessing.pool import Pool
# read URLs into list `urls`
with open("urls.txt", "r") as urlsFile:
urls = urlsFile.readlines()
# a function to process a single URL
def my_url_function(url):
# each proc uses it's own driver
driver = webdriver.Firefox(executable_path='/usr/local/lib/geckodriver/geckodriver')
driver.get(url)
print("Got {}".format(url))
# a multiprocessing pool with 2 threads
pool = Pool(processes=2)
map_results_list = pool.map(my_url_function, urls)
print(map_results_list)
此示例使用python的多处理模块实际同时处理2个URL-尽管您当然可以在设置池时更改进程数。
pool.map()
函数接受一个函数和一个列表,并在列表上进行迭代,将每个项目发送到该函数,并在其自己的进程中运行每个函数调用。
更改my_url_function()
函数以执行您想要的操作,但不要在多进程函数中共享资源-让每个函数生成自己的驱动程序,以及函数可能需要的其他任何功能。可以在并发函数之间共享某些内容,但是根本不共享任何内容更安全。