Question

我有一些相当大的csv文件（~10gb），并希望利用dask进行分析。但是，根据我设置要读入的dask对象的分区数量，我的groupby结果会发生变化。我的理解是，dask利用了分区的核心外处理优势，但它仍然会返回适当的groupby输出。这似乎并非如此，我正在努力找出必要的替代设置。以下是一个小例子：

df = pd.DataFrame({'A': np.arange(100), 'B': np.random.randn(100), 'C': np.random.randn(100), 'Grp1': np.repeat([1, 2], 50), 'Grp2': [3, 4, 5, 6], 25)})

test_dd1 = dd.from_pandas(df, npartitions=1)
test_dd2 = dd.from_pandas(df, npartitions=2)
test_dd5 = dd.from_pandas(df, npartitions=5)
test_dd10 = dd.from_pandas(df, npartitions=10)
test_dd100 = dd.from_pandas(df, npartitions=100)

def test_func(x):
    x['New_Col'] = len(x[x['B'] > 0.]) / len(x['B'])
    return x

test_dd1.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head()
   A               B               C Grp1 Grp2 New_Col
0  0 -0.561376 -1.422286     1     3     0.48
1  1 -1.107799  1.075471     1     3     0.48
2  2 -0.719420 -0.574381     1     3     0.48
3  3 -1.287547 -0.749218     1     3     0.48
4  4  0.677617 -0.908667     1     3     0.48

test_dd2.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head()
   A               B              C  Grp1 Grp2 New_Col
0  0 -0.561376 -1.422286     1     3     0.48
1  1 -1.107799  1.075471     1     3     0.48
2  2 -0.719420 -0.574381     1     3     0.48
3  3 -1.287547 -0.749218     1     3     0.48
4  4  0.677617 -0.908667     1     3     0.48

test_dd5.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head()
   A               B              C  Grp1 Grp2 New_Col
0  0 -0.561376 -1.422286     1     3     0.45
1  1 -1.107799  1.075471     1     3     0.45
2  2 -0.719420 -0.574381     1     3     0.45
3  3 -1.287547 -0.749218     1     3     0.45
4  4  0.677617 -0.908667     1     3     0.45

test_dd10.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head()
   A               B              C  Grp1 Grp2 New_Col
0  0 -0.561376 -1.422286     1     3      0.5
1  1 -1.107799  1.075471     1     3      0.5
2  2 -0.719420 -0.574381     1     3      0.5
3  3 -1.287547 -0.749218     1     3      0.5
4  4  0.677617 -0.908667     1     3      0.5

test_dd100.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head()
   A               B              C  Grp1 Grp2  New_Col
0  0 -0.561376 -1.422286     1     3        0
1  1 -1.107799  1.075471     1     3        0
2  2 -0.719420 -0.574381     1     3        0
3  3 -1.287547 -0.749218     1     3        0
4  4  0.677617 -0.908667     1     3        1

df.groupby(['Grp1', 'Grp2']).apply(test_func).head()
   A               B               C Grp1 Grp2 New_Col
0  0 -0.561376 -1.422286     1     3     0.48
1  1 -1.107799  1.075471     1     3     0.48
2  2 -0.719420 -0.574381     1     3     0.48
3  3 -1.287547 -0.749218     1     3     0.48
4  4  0.677617 -0.908667     1     3     0.48

groupby步骤是否仅在每个分区内运行而不是查看整个数据帧？在这种情况下设置npartitions = 1是微不足道的，它似乎不会影响性能，但是由于read_csv自动设置一定数量的分区，你如何设置调用以确保groupby结果是准确？

谢谢！

Answer 1

我对这个结果感到惊讶。无论分区数量如何，Groupby.apply都应返回相同的结果。如果您能提供可重现的示例，我建议您raise an issue，其中一位开发人员将会看一下。

Dask DataFrame Groupby分区

1 个答案: