Question

我有一个熊猫数据框：

df2 = pd.DataFrame({'c':[1,1,1,2,2,2,2,3],
                    'type':['m','n','o','m','m','n','n', 'p']})

我想找到c的哪些值具有多个唯一类型，对于那些返回c的值，唯一类型的数量以及所有唯一类型串联在一个字符串中。

到目前为止，我已经使用了这两个问题：

pandas add column to groupby dataframe Python Pandas: concatenate rows with unique values

df2['Unique counts'] = df2.groupby('c')['type'].transform('nunique')

df2[df2['Unique counts'] > 1].groupby(['c', 'Unique counts']).\
                                  agg(lambda x: '-'.join(x))

Out[226]: 
                    type
c Unique counts         
1 3                m-n-o
2 2              m-m-n-n

这可行，但是我无法获得唯一值（例如，在第二行中，我只希望有一个m和一个n。我的问题如下：

我可以跳过创建“唯一计数”的步骤与创建一些临时的？
如何仅过滤唯一值在第二步？

Answer 1

先删除唯一行然后计数值的解决方案-创建助手系列s，并使用唯一字符串set s：

s= df2.groupby('c')['type'].transform('nunique').rename('Unique counts')
a = df2[s > 1].groupby(['c', s]).agg(lambda x: '-'.join(set(x)))
print (a)

                  type
c Unique counts       
1 3              o-m-n
2 2                m-n

另一个想法是先通过DataFrame.duplicated删除重复项：

df3 = df2[df2.duplicated(['c'],keep=False) & ~df2.duplicated(['c','type'])]
print (df3)

   c type
0  1    m
1  1    n
2  1    o
3  2    m
5  2    n

然后使用join聚合计数：

a = df3.groupby('c')['type'].agg([('Unique Counts', 'size'), ('Type', '-'.join)])
print (a)
   Unique Counts   Type
c                      
1              3  m-n-o
2              2    m-n

或者如果需要的话，首先汇总所有值：

df4 = df2.groupby('c')['type'].agg([('Unique Counts', 'nunique'), 
                                  ('Type', lambda x: '-'.join(set(x)))])
print (df4)
   Unique Counts   Type
c                      
1              3  o-m-n
2              2    m-n
3              1      p

最后通过boolean indexing删除唯一行：

df5 = df4[df4['Unique Counts'] > 1]
print (df5)
   Unique Counts   Type
c                      
1              3  o-m-n
2              2    m-n

Answer 2

使用DataFrame.groupby.agg并通过tuple中的(column name, function)：

df2.groupby('c')['type'].agg([('Unique Counts', 'nunique'), ('Type', lambda x: '-'.join(x.unique()))])

[出]

   Unique Counts   Type
c                      
1              3  m-n-o
2              2    m-n
3              1      p

Answer 3

使用groupby.agg并根据需要过滤Unique counts列：

df2 = (df2.groupby('c', as_index=False)
          .agg({'type': ['nunique', lambda x: '-'.join(np.unique(x))]}))
df2.columns = ['c','Unique counts','type']

print(df2)
   c  Unique counts   type
0  1              3  m-n-o
1  2              2    m-n
2  3              1      p

对Unique counts进行过滤：

df2 = df2.loc[df2['Unique counts']>1,:]

print(df2)
   c  Unique counts   type
0  1              3  m-n-o
1  2              2    m-n

熊猫过滤器以大于1的唯一性并连接唯一值

3 个答案: