Question

从今天开始，我有一个脚本，该脚本读取站点中的数据表，并循环单击每一行以单击条目以获取附加数据，然后返回并重复。伪代码如下

browser=webdriver.Chrome()

node_list=FuncNode(browser)  #This function loops through each row and get in 
  #text the node identifier. This way, I don't lose the reference after 
  #clicking and going back due to changes in DOM

我有了清单

for track_id in node_list:

   node=Search_for_node_in_main_page(track_id) #Now I have the row in a node

   #Get some data

   button=Get_row_button(node)

   button.click()

   #Now I change the focus onto the new tab, do some scraping, and write all 
   data to a MySQL database

   #Close new tab and focus back my browser on main tab 

   #end of the loop, repeat until the last item on list is scraped

这通常需要一段时间，所以我想知道如何使用“多重处理”来优化它。根据我的阅读，最接近的事情是一旦有了列表，创建一个Pool，将所有代码封装在一个函数中，然后将该Pool应用于列表和该函数

if __name__=='__main__':
  with Pool(4) as p:
    records = p.map(cool_function,node_list)

    p.terminate()
    p.join()

我的问题是，我在这里正在使用浏览器，所以我想对于每个进程我都必须打开一个不同的浏览器。如果是这样，我该如何重用它们？主要是因为该页面上大量使用javascript，根据页面的不同，加载页面要花一些时间，要比要刮4-5行要花费更多的时间。

此外，考虑到它可以以某种方式工作，它是否会对尝试从不同进程同时编写它的MySQL产生影响？

因此，简而言之，我如何在此处进行多进程处理并优化初始脚本？

网站抓取脚本中的多进程

0 个答案: