大型pandas数据帧并行读取的最快实现?

时间:2017-06-23 20:47:48

标签: python performance pandas dataframe parallel-processing

我一直在处理越来越大的数据集。我喜欢Python和Pandas,不想放弃使用这些工具。我的一个数据帧需要12分钟才能加载。我想加快速度,似乎使用多个处理器将是最好的方式。

读取可能被压缩的制表符分隔文件的最快实现是什么?我打算使用Dask,但我无法让它工作。

我无法通过这个问题得到dask方法,因为样本大小不足以排成行(不知道如何概括它)read process and concatenate pandas dataframe in parallel with dask < / p>

我尝试了以下方法来制作更快的tsv阅读器: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html

def count_lines(path):
    return int(subprocess.check_output('wc -l {}'.format(path), shell=True).split()[0])
def _process_frame(df):
    return df
def read_df_parallel(path, index_col=0, header=0, compression="infer", engine="c", n_jobs=-1):
    # Compression
    if compression == "infer":
        if path.endswith(".gz"):
            compression = "gzip"
    # Parallel
    if n_jobs == -1:
        n_jobs = multiprocessing.cpu_count()
    if n_jobs == 1:
        df = pd.read_table(path, sep="\t", index_col=np.arange(index_col+1), header=header, compression=compression, engine=engine)
    else:
        # Set up workers
        pool = multiprocessing.Pool(n_jobs)
        num_lines = count_lines(path)
        chunksize = num_lines // n_jobs
        reader = pd.read_table(path, sep="\t", index_col=np.arange(index_col+1), header=header, compression=compression, engine=engine, chunksize=chunksize, iterator=True)
        # Iterate through dataframes
        df_list = list()
        for chunk in reader:
            df_tmp = pool.apply_async(_process_frame, [chunk])
            df_list.append(df_tmp)
        df = pd.concat(f.get() for f in df_list)
    return df

为什么并行版本会变慢?

在大型gzip(或非)表中读取大熊猫数据帧的最快实现是什么?

%%time
path = "./Data/counts/gt2500.counts.tsv.gz"
%timeit read_df_parallel(path, n_jobs=1)
%timeit read_df_parallel(path, n_jobs=-1)

5.62 s ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.81 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CPU times: user 1min 30s, sys: 8.66 s, total: 1min 38s
Wall time: 1min 39s

0 个答案:

没有答案