我一直在处理越来越大的数据集。我喜欢Python和Pandas,不想放弃使用这些工具。我的一个数据帧需要12分钟才能加载。我想加快速度,似乎使用多个处理器将是最好的方式。
读取可能被压缩的制表符分隔文件的最快实现是什么?我打算使用Dask,但我无法让它工作。
我无法通过这个问题得到dask方法,因为样本大小不足以排成行(不知道如何概括它)read process and concatenate pandas dataframe in parallel with dask < / p>
我尝试了以下方法来制作更快的tsv阅读器: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html
def count_lines(path):
return int(subprocess.check_output('wc -l {}'.format(path), shell=True).split()[0])
def _process_frame(df):
return df
def read_df_parallel(path, index_col=0, header=0, compression="infer", engine="c", n_jobs=-1):
# Compression
if compression == "infer":
if path.endswith(".gz"):
compression = "gzip"
# Parallel
if n_jobs == -1:
n_jobs = multiprocessing.cpu_count()
if n_jobs == 1:
df = pd.read_table(path, sep="\t", index_col=np.arange(index_col+1), header=header, compression=compression, engine=engine)
else:
# Set up workers
pool = multiprocessing.Pool(n_jobs)
num_lines = count_lines(path)
chunksize = num_lines // n_jobs
reader = pd.read_table(path, sep="\t", index_col=np.arange(index_col+1), header=header, compression=compression, engine=engine, chunksize=chunksize, iterator=True)
# Iterate through dataframes
df_list = list()
for chunk in reader:
df_tmp = pool.apply_async(_process_frame, [chunk])
df_list.append(df_tmp)
df = pd.concat(f.get() for f in df_list)
return df
为什么并行版本会变慢?
在大型gzip(或非)表中读取大熊猫数据帧的最快实现是什么?
%%time
path = "./Data/counts/gt2500.counts.tsv.gz"
%timeit read_df_parallel(path, n_jobs=1)
%timeit read_df_parallel(path, n_jobs=-1)
5.62 s ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.81 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CPU times: user 1min 30s, sys: 8.66 s, total: 1min 38s
Wall time: 1min 39s