具有分类变量组合的GroupBy

时间:2015-11-25 11:43:00

标签: python pandas group-by dataframe grouping

假设我有数据:

pd.DataFrame({'index': ['a','b','c','a','b','c'], 'column': [1,2,3,4,1,2]}).set_index(['index'])

给出:

       column
index
a           1
b           2
c           3
a           4
b           1
c           2

然后得到每个子组的平均值:

df.groupby(df.index).mean()

       column
index
a         2.5
b         1.5
c         2.5

然而,我一直试图在不经常循环和切片数据的情况下实现的目标是如何获得子组对的平均值?

例如,a&的平均值。 b是2?好像他们的价值观结合在一起。

输出类似于:

       column
index
a & a     2.5
a & b     2.0
a & c     2.5
b & b     1.5
b & c     2.0
c & c     2.5

最好这将涉及操纵'groupby'中的参数,但实际上,我不得不求助于循环和切片。能够在某个时刻构建子组的所有组合。

3 个答案:

答案 0 :(得分:1)

3年后,我又重新解决了这个问题。

该开源库中正在使用它,这就是为什么我现在能够做到这一点here,并且它可以与任意数量的索引一起使用并使用numpy矩阵广播在它们上创建组合

首先,这不是有效的数据框。索引不是唯一的。让我们向该对象添加另一个索引并使其成为系列:

df = pd.DataFrame({
    'unique': [1, 2, 3, 4, 5, 6], 
    'index': ['a','b','c','a','b','c'], 
    'column': [1,2,3,4,1,2]
}).set_index(['unique','index'])
s = df['column']

让我们拆开该索引:

>>> idxs = ['index'] # set as variable to be used later on
>>> unstacked = s.unstack(idxs)
       column
index       a    b    c
unique
1         1.0  NaN  NaN
2         NaN  2.0  NaN
3         NaN  NaN  3.0
4         4.0  NaN  NaN
5         NaN  1.0  NaN
6         NaN  NaN  2.0
>>> vals = unstacked.values
array([[  1.,  nan,  nan],
       [ nan,   2.,  nan],
       [ nan,  nan,   3.],
       [  4.,  nan,  nan],
       [ nan,   1.,  nan],
       [ nan,  nan,   2.]])
>>> sum = np.nansum(vals, axis=0)
>>> count = (~np.isnan(vals)).sum(axis=0)
>>> mean = (sum + sum[:, np.newaxis]) / (count + count[:, np.newaxis])
array([[ 2.5,  2. ,  2.5],
       [ 2. ,  1.5,  2. ],
       [ 2.5,  2. ,  2.5]])

现在重新创建输出数据框:

>>> new_df = pd.DataFrame(mean, unstacked.columns, unstacked.columns.copy())
index_    a    b    c
index
a       2.5  2.0  2.5
b       2.0  1.5  2.0
c       2.5  2.0  2.5
>>> idxs_ = [ x+'_' for x in idxs ]
>>> new_df.columns.names = idxs_
>>> new_df.stack(idxs_, dropna=False)
index  index_
a      a         2.5
       b         2.0
       c         2.5
b      a         2.0
       b         1.5
       c         2.0
c      a         2.5
       b         2.0
       c         2.5

答案 1 :(得分:0)

我目前的实施是:

 import pandas as pd
 import itertools
 import numpy as np

    # get all pair of categories here
def all_pairs(df, ix):
    hash = {
        ix: [],
        'p': []
    }
    for subset in itertools.combinations(np.unique(np.array(df.index)), 2):
        hash[ix].append(subset)
        hash['p'].append(df.loc[pd.IndexSlice[subset], :]).mean)

    return pd.DataFrame(hash).set_index(ix)

获取组合然后将它们添加到has然后构建回数据框。但它很酷:(

答案 2 :(得分:0)

这是一个使用MultiIndex和外部联接来处理交叉连接的实现。

import pandas as pd
from pandas import DataFrame, Series
import numpy as np

df = pd.DataFrame({'index': ['a','b','c','a','b','c'], 'column': [1,2,3,4,1,2]}).set_index(['index'])

groupedDF = df.groupby(df.index).mean()
# Create new MultiIndex using from_product which gives a paring of the elements in each iterable
p = pd.MultiIndex.from_product([groupedDF.index, groupedDF.index])
# Add column for cross join
groupedDF[0] = 0
# Outer Join
groupedDF = pd.merge(groupedDF, groupedDF, how='outer', on=0).set_index(p)
# get mean for every row (which is the average for each pair)
# unstack to get matrix for deduplication
crossJoinMeans = groupedDF[['column_x', 'column_y']].mean(axis=1).unstack()
# Create Identity matrix because each pair of itself will be needed
b = np.identity(3, dtype='bool')
# set the first column to True because it contains the rest of the unique means (the identity portion covers the first part)
b[:,0] = True
# invert for proper use of DataFrame Mask
b = np.invert(b)
finalDF = crossJoinMeans.mask(b).stack()

我猜这可以清理并使其更加简洁。