我需要解析大约1000个网址。到目前为止,我有一个函数在解析URL后返回一个pandas数据帧。我应该如何最好地构建程序,以便将所有数据框组合在一起?我也不确定如何将论据归还给期货'。在下面的示例中,我如何最终将所有临时数据帧合并到一个数据帧中(即finalDF = finalDF.append(temp)
import concurrent.futures
def Parser(ptf):
temp=pd.DataFrame()
URL="http://"+str(URL)
#..some complex operations, including a requests.get(URL) which returns eventually a temp: a pandas dataframe
return temp #returns a pandas dataframe
def conc_caller(ptf):
temp=Parser(ptf)
#this won't work because finalDF is not defined, unclear how to handle this
finalDF= finalDF.append(temp)
return df
booklist=['a','b','c']
finalDF=pd.DataFrame()
executor = concurrent.futures.ProcessPoolExecutor(3)
futures = [executor.submit(conc_caller, item) for item in booklist]
concurrent.futures.wait(futures)
另一个问题是我收到错误消息:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
任何有关如何修复代码的建议都表示赞赏。
答案 0 :(得分:1)
您必须使用if __name__ == '__main__':
保护启动代码,以防止永久创建进程。
就在concurrent.futures.wait(futures)