使用pandas迭代数十万个csv文件

时间:2018-02-15 13:29:18

标签: python multithreading python-3.x pandas multiprocessing

我目前正在使用concurrent.futures.ProcessPoolExectutor来迭代大量的CSV文件,如下所示:

def readcsv(file):
    df = pd.read_csv(file, delimiter="\s+", names=[headers], comment="#")
    #DOING SOME OTHER STUFF TO IT 
    full.append(df) 

if __name__ == "__main__":
    full = []
    files = "glob2 path to files" 
    with concurrent.futures.ProcessPoolExecutor(max_workers=45) as proc:
        proc.map(readcsv,files)
    full = pd.concat(full)

这当前不能以这种方式工作,因为它返回一个ValueError" No Objects to concatenate"在最后一行。如何遍历文件并将它们附加到列表然后将它们连接起来或者只是尽可能快地将它们直接放入数据框中?可用资源为64gb ram,虚拟机中有46个核心。

1 个答案:

答案 0 :(得分:1)

map函数实际上returns an iterable包含函数的结果。所以你只需要返回df

def readcsv(file):
    df = pd.read_csv(file, delimiter="\s+", names=[headers], comment="#")
    #DOING SOME OTHER STUFF TO IT 
    return df

if __name__ == "__main__":
    files = "glob2 path to files" 
    with concurrent.futures.ProcessPoolExecutor(max_workers=45) as proc:
        full = pd.concat(proc.map(readcsv,files))