1。完全是dask

import dask.dataframe as dd
from multiprocessing import cpu_count

#Count the number of cores
cores = cpu_count()

#read and part the dataframes by the number of cores
english = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
               sep='\r', header=None, names=['ingles'], dtype={'ingles':str})
english = english.repartition(npartitions=cores)
spanish = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
              sep='\r', header=None, names=['espanol'], dtype={'espanol':str})
spanish = english.repartition(npartitions=cores)

#compute
%time total_dd = dd.merge(english, spanish, left_index=True, right_index=True).compute()

Out: 9.77 seg

2。熊猫+达斯河

import pandas as pd
import dask.dataframe as dd
from multiprocessing import cpu_count

#Count the number of cores
cores = cpu_count()

#Read the Dataframe and part by the number of cores
pd_english = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
                      sep='\r', header=None, names=['ingles'])

pd_spanish = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
                      sep='\r', header=None, names=['espanol'])
english_pd = dd.from_pandas(pd_english, npartitions=cores)
spanish_pd = dd.from_pandas(pd_spanish, npartitions=cores)

#compute
%time total_pd = dd.merge(english_pd, spanish_pd, left_index=True, right_index=True).compute()

Out: 1.31 seg

有人知道为什么吗？还有其他方法可以更快地做到吗？

Answer 1

请注意：

dd.read_csv（...）实际上没有读取任何内容。只有计算树的构建步骤。
运行 compute 时，整个计算树到目前为止，实际上已执行，包括对两个DataFrame的读取。

因此在第一个变体中，定时操作包括：

读取两个DataFrame，
重新分区
最后是 merge 本身。

在第二个变体中，就定时而言，情况有所不同。之前已经读取过两个DataFrame，所以定时操作仅包含分区和合并。

显然，源数据帧很大，读取它们需要相当长的时间，在第二个变体中没有考虑。

尝试另一个测试：创建一个函数，该函数：

读取两个DataFrames pd.read_csv（...）
执行其余步骤（分区和合并）。

然后计算此函数的执行时间。

我想，执行时间可能比更长第一个变体，因为：

在第一个变体中，同时读取两个数据框（通过不同的内核），
依次阅读。

为什么Dask使用from_pandas计算数据帧比直接使用dask读取数据帧更快？

1。完全是dask

2。熊猫+达斯河

1 个答案: