ThreadPoolExecutor - 如何返回参数

时间:2016-02-01 12:40:02

标签: python pandas threadpoolexecutor

我需要解析大约1000个网址。到目前为止,我有一个函数在解析URL后返回一个pandas数据帧。我应该如何最好地构建程序,以便将所有数据框组合在一起?我也不确定如何将论据归还给期货'。在下面的示例中,我如何最终将所有临时数据帧合并到一个数据帧中(即finalDF = finalDF.append(temp)

import concurrent.futures

def Parser(ptf):
    temp=pd.DataFrame()
    URL="http://"+str(URL)
    #..some complex operations, including a requests.get(URL) which returns eventually a temp: a pandas dataframe
    return temp #returns a pandas dataframe

def conc_caller(ptf):
    temp=Parser(ptf)

    #this won't work because finalDF is not defined, unclear how to handle this
    finalDF= finalDF.append(temp)
    return df

booklist=['a','b','c']
finalDF=pd.DataFrame()        
executor = concurrent.futures.ProcessPoolExecutor(3)
futures = [executor.submit(conc_caller, item) for item in booklist]
concurrent.futures.wait(futures)

另一个问题是我收到错误消息:

 An attempt has been made to start a new process before the
 current process has finished its bootstrapping phase.

任何有关如何修复代码的建议都表示赞赏。

1 个答案:

答案 0 :(得分:1)

您必须使用if __name__ == '__main__':保护启动代码,以防止永久创建进程。 就在concurrent.futures.wait(futures)

之前