我正在试图使用Selenium来获取网站上特定搜索的结果数量。基本上,我想让这个过程运行得更快。我的代码通过迭代搜索术语然后通过报纸工作,并将收集的数据输出到CSV。目前,这将在3年内生成3个搜索词x 3个报纸,每个CSV大约10分钟就能得到9个CSV。
我想使用多处理同时或至少更快地运行每个搜索和报纸组合。我已尝试在此处关注其他示例,但未能成功实施它们。以下是我目前的代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
import pandas as pd
from multiprocessing import Pool
def websitesearch(search):
try:
start = list_of_inputs[0]
end = list_of_inputs[1]
newsabbv=list_of_inputs[2]
directory=list_of_inputs[3]
os.chdir(directory)
if search == broad:
specification = "broad"
relPapers = newsabbv
elif search == narrow:
specification = "narrow"
relPapers = newsabbv
elif search == general:
specification = "allarticles"
relPapers = newsabbv
else:
for newspapers in relPapers:
...rest of code here that gets the data and puts it in a list named all_Data...
browser.close()
df = pd.DataFrame(all_Data)
df.to_csv(filename, index=False)
except:
print('error with item')
if __name__ == '__main__':
...Initializing values and things like that go here. This helps with the setup for search...
#These are things that go into the function
start = ["January",2015]
end = ["August",2017]
directory = "STUFF GOES HERE"
newsabbv = all_news_abbv
search_list = [narrow, broad, general]
list_of_inputs = [start,end,newsabbv,directory]
pool = Pool(processes=4)
for search in search_list:
pool.map(websitesearch, search_list)
print(list_of_inputs)
如果我在main()函数中添加一个print语句,它将打印出来,但实际上并没有发生任何事情。我很感激所有的帮助。我遗漏了获取值的代码并将其放入列表中,因为它很复杂,但我知道它有效。
提前感谢您的帮助!如果我能提供更多信息,请告诉我。
艾萨克
编辑:我已经在线查看了更多帮助,并意识到我误解了使用pool.map(fn,list)将列表映射到函数的目的。我已经更新了我的代码以反映我当前的方法仍然无效。我还将初始化值移动到主函数中。
答案 0 :(得分:0)
我不认为这可以按照您的方式进行多处理。因为那里仍然有硒引起的队列处理(不是队列模块)。
原因是...硒只能处理一个窗口,不能同时处理多个窗口或选项卡浏览器(window_handle功能的限制)。这意味着....您的多进程仅处理内存中发送到硒或被硒抓取的数据进程。通过尝试在一个脚本文件中处理硒的爬网,将使硒成为瓶颈工艺的来源。
实现真正的多进程的最佳方法是:
例如:
import -> all modules that you need to run selenium
import sys
url = sys.argv[1] #you will catch the url
driver = ......#open browser
driver.get(url)
#just continue the script base on your method
print(--the result that you want--)
sys.exit(0)
我可以提供更多的解释,因为这是该过程的主要核心,只有您自己理解,您想在该网络上做什么。
a。设计URL,多进程意味着创建一些进程并与所有CPU内核一起运行,这是实现它的最佳方法...首先是确定输入过程,在您的情况下,也许是URL目标(您不给我们) ,您要抓取的网站目标)。但是网站的每个页面都有不同的网址。只需收集所有url并将其分为几个组即可(最佳实践:您的cpu核心-1)
例如:
import multiprocessing as mp
cpucore=int(mp.cpu_count())-1.
b。将网址发送到使用您之前进行过的crawl.py处理(通过子流程或其他模块,例如os.system)。确保您运行crawl.py max == cpucore。
例如:
crawler = r'YOUR FILE DIRECTORY\crawler.py'
def devideurl():
global url1, url2, url3, url4
make script that result:
urls1 = groups or list of url
urls2 = groups or list of url
urls3 = groups or list of url
urls4 = groups or list of url
def target1():
for url in url1:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
#do you see the combination between python crawler and url?
#the cmd command will be: python crawler.py "value", the "value" is captured by sys.argv[1] command in crawler.py
def target2():
for url in url2:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
def target3():
for url in url1:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
def target4():
for url in url2:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
cpucore = int(mp.cpu_count())-1
pool = Pool(processes="max is the value of cpucore")
for search in search_list:
pool.map(target1, devideurl)
pool.map(target2, devideurl)
pool.map(target3, devideurl)
pool.map(target4, devideurl)
#you can make it, more, depend on your cpu core
c。将打印的结果保存到主脚本的内存中
d。继续您的脚本过程来处理您已经获得的数据。使用此方法:
您可以打开许多浏览器窗口并同时处理它,并且由于从网站爬网进行的数据处理比内存中的数据处理慢,因此该方法至少可以减少数据流的瓶颈。意味着它比以前的方法要快。
非常乐于助人...欢呼