我遇到了一个问题,因为我必须找到一种方法来对网络上的2000多种产品进行硒搜索。到目前为止,我遇到的主要问题是无法创建和启动2000个Webdriver实例,否则最终将消耗掉我所有的RAM。最重要的是,由于对同一网站使用相同会话进行的重试次数过多,因此我还会收到MaxRetryError。在多处理过程中,是否可以同时打开最多N个Web驱动程序?
我正在考虑这样的事情:
class ExecutorWeb1():
def __init__(self, df):
self.drivers = []
self.n_drivers = multiprocessing.cpu_count() - 1
for n in range(0, self.n_drivers):
self.drivers.append(webdriver.Firefox())
self.df = df
def exec(self, data):
print(data[0][data[1]])
product = data[0][data[1]]
worker_n = multiprocessing.current_process()._identity[0]
if product is not None:
product_scraping = ProductScrapingWeb1(self.drivers[worker_n - 1], product)# Where i have an array of Firefox webdrivers of 'cpu_count() - 1' length
models = product_scraping.main()
return (models, data[1])
def parallelize(self):
pool = Pool(self.n_drivers)
index = 0
rows = [row for i, row in self.df.iterrows()]
ids = range(0, len(rows))
# data is a list of pairs where each pair has a row of my dataframe and its index
data = list(zip(rows, ids))
pool_results = pool.map(self.exec, data)
pool.close()
pool.join()
# Do smt
return #smt
但是我得到了
Traceback (most recent call last):
File "scraping-main.py", line 104, in <module>
ex.parallelize()
File "/home/giulio/Desktop/scraping/ExecutorWeb1.py", line 34, in parallelize
pool_results = pool.map(self.exec, data)
File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object
这是因为Webdriver无法选择,并且在进行多处理时无法将它们传递给子进程。
有人能想到可行的解决方案吗?