在线程中应用pandas返回df列列表而不是计算值

时间:2018-11-07 16:05:43

标签: python-3.x multithreading pandas getattr

我看到一个pandas dataframe urls_df,其中包含合并而成的部分,构成一个网址。基本上有2种方法可以组合这些组件并检索url的内容。 URLObject将负责此部分,并以dict格式从所生成的URL中返回Web内容和服务器内容。它是这样工作的:

list(df.apply(lambda row: getattr(URLObject(row[discriminant_col],                                                             
                                            row['col_1'],
                                            row['col_2']), 
                                   attribute), 
               axis=1)) #returns list of dicts

urls_df包含2个判别列['URL_DISCRIMINANT_COLUMN_1','URL_DISCRIMINANT_COLUMN_2'],它们稍微改变了网址以获取特定的文件夹。现在,对于每一行,我必须从上面运行getattr(URLObject...,每个源运行4次,Web中有2个文件夹,服务器中有2个文件夹。挑战在于如何并行运行这些程序,如下所示。我正在做4 threads,每列一次。 我遇到的问题是由于线程问题,数据有时可以正常显示,有时返回urls_df列的名称而不是内容。有人可以解释我做错了什么以及我可以做些什么来纠正此问题?

from multiprocessing.dummy import Pool as ThreadPool

    def do_threaded(tasks):
        with ThreadPool(len(tasks)) as pool:
            results = [pool.apply_async(*t) for t in tasks]
            results = [res.get() for res in results] 
        return results


    def func(df, discriminant_col, arribute):             
        return list(df.apply(lambda row: getattr(URLObject(row[discriminant_col],                                                             
                                                          row['col_1'],
                                                          row['col_2']), 
                                                  attribute), 
                                  axis=1))

urls_df = pd.DataFrame()# this is a pandas dataframe with data     
tasks = [(func, (input_df, discriminant, atr)) for atr in [server_data, web_data]  for discriminant in ['URL_DISCRIMINANT_COLUMN_1','URL_DISCRIMINANT_COLUMN_2']] 

table = do_threaded(tasks)
headers = [header.pop(0) for header in table]
content_df = pd.DataFrame(table).
content_df.columns = headers

0 个答案:

没有答案