Question

考虑以下代码，其目的是将列除以其分组平均值：

df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
                   'groupid' : np.random.choice(['one','two'], n),
                  'coef' : np.random.randn(n)})
df.set_index('expenditure', inplace=True)
test = df.groupby(level=0).apply(lambda x: x['coef'] / x.coef.mean())

在apply之前，我喜欢这种数据结构，而我通常只能df['someNewColumn'] = df.apply(...)。但奇怪的是，这一次，我无法立即重新合并结果。

test应该被groupby发生的索引expenditure编入索引。但是，它有一个双重索引：

>>> test
expenditure  expenditure
bar          bar           -0.491900
             bar           -9.332964
             bar            8.019472
             bar           -4.540905
             bar            5.627947
             bar           -0.171765
             bar            5.698813
             bar            6.476207
             bar            8.796249
             bar           -8.284087
             bar            1.426311
             bar           -1.223377
foo          foo            1.900897
             foo            7.057078
             foo            0.060856
             foo            3.850323
             foo            2.928085
             foo           -3.249857
             foo            3.176616
             foo           -1.433766
             foo            0.910017
             foo            1.395376
             foo            1.898315
             foo           -1.903462
             foo           -3.590479
Name: coef, dtype: float64

为什么它有双索引，如何获得标准化列？

>>> test.index
MultiIndex(levels=[[u'bar', u'foo'], [u'bar', u'foo']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
           names=[u'expenditure', u'expenditure'])

我的pandas版本为0.15.0。

Answer 1

对我来说，您使用哪种版本的大熊猫并不明显，但您的申请根本不适用于我。

我在索引上分组时遇到问题。所以我总是将索引和组转储到普通列：

df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
                   'groupid' : np.random.choice(['one','two'], n),
                  'coef' : np.random.randn(n)})

然后你可以做到：

df.groupby('expenditure').coef.apply(lambda x: x / x.mean())

或以下几乎正是您之前尝试的内容

df.groupby('expenditure').apply(lambda x: x.coef / x.coef.mean())

Answer 2

我不完全确定这是否对您有所帮助，但由于您已将列expenditure编入索引，因此您需要在应用之前将此索引分组以实现我认为您想要的内容，例如：

import pandas as pd

n = 10
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
                   'groupid' : np.random.choice(['one','two'], n),
                  'coef' : np.random.randn(n)})

df.set_index('expenditure', inplace=True)

# when you try to apply, you need to groupby 'expenditure' -- which is the df.index
test = df.groupby(df.index).apply(lambda x: x['coef'] /x.coef.mean())

test

expenditure
bar            expenditure
bar            2.013101
bar       ...
foo            expenditure
foo            1
Name: coef, dtype...
dtype: object

test.index
Index([u'bar', u'foo'], dtype='object')

Answer 3

为了便于说明，让我们使您的数据框架更简单：

import numpy as np
import pandas as pd
n = 10
np.random.seed(0)
df = pd.DataFrame(
    data = {
        'groupid' : np.random.choice(['one','two'], n),
        'coef' : np.arange(n)
    }, 
    index=pd.Index(np.random.choice(['foo','bar'], n), name='expenditure'),
)
df


             coef groupid
expenditure              
bar             0     one
foo             1     two
foo             2     two
bar             3     one
foo             4     two
foo             5     two
foo             6     two
foo             7     two
foo             8     two
bar             9     two

您可以使用两种不同的方法计算每个coef组的平均值expenditure：

means = df['coef'].mean(level='expenditure')

或

means = df['coef'].groupby(level='expenditure').mean()

两个都给我：

expenditure
bar            4.000000
foo            4.714286
Name: coef, dtype: float64

那么我们可以将coef列除以分组均值并将其与expenditure值进行广播：

test = df['coef'].div(means, level='expenditure')
test

expenditure
bar            0.000000
bar            0.750000
bar            2.250000
foo            0.212121
foo            0.424242
foo            0.848485
foo            1.060606
foo            1.272727
foo            1.484848
foo            1.696970
Name: coef, dtype: float64

我们在bar组中的原始值分别为0,3和9，因此结果为0.0,0.75,2.25。

Pandas group by添加索引

3 个答案: