Question

Apologies in advance. I'm new to Python / Pandas so this question is probably poorly posed.

I have a dataframe with about 4 million rows and roughly 10 columns.

I want to compute the mean of the first column (say A) for each group defined by the distinct values of each of the other columns (say B, C, D, E, F, G, H, I, J). This defines about 200,000 groups.

I tried groupby, e.g.

mytest = df.groupby(['B','C','D', 'E', 'F', 'G', 'H', 'I', 'J'])
mytest.mean()

This causes Python to grab all of the memory on the computer (32GB) and crash. In Stata, I can obtain the desired result when I type:

collapse A, by(B C D E F G H I J)

which it does like a champ.

How would I go about the same operation using Pandas / Python? Any help is much appreciated.

Answer 1

我刚跑了这个

df = pd.DataFrame((np.random.rand(4000000, 10) * 10).astype(int),
                   columns=list('ABCDEFGHIJ'))

gb = df.groupby(list('BCDEFGHIJ'))

gb.mean()

没有问题。我也使用32 GB内存机器。我已经使用了一堆内存，这没有太大的影响。我猜这个问题是：

这定义了大约200,000个组。

我唯一能想到的是将groupby对象限制为['A']列。像这样：

gb = df.groupby(list('BCDEFGHIJ'))['A']

否则，您将不得不编写另一种算法......也许。

pandas grouby with many keys

1 个答案: