Question

我需要做一个模糊groupby，其中一条记录可以在一个或多个组中。

我有DataFrame这样：

test = pd.DataFrame({'score1' : pandas.Series(['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series(['b', 'a', 'k', 'n', 'c'])})

输出：

  score1  score2
0   a       b
1   b       a
2   c       k
3   d       n
4   e       c

我希望有这样的团体：

组密钥应该是score1和score2之间唯一值的并集。记录0应该在分组a和b中，因为它包含两个分数值。同样，记录1应该在群组b和a中;记录2应分为c和k组，依此类推。

我尝试在两个列上执行groupby：

In [192]: score_groups = pd.groupby(['score1', 'score2'])

但是我将组密钥作为元组 - （1,2），（2,1），（3,8）等，而不是唯一的组密钥，其中记录可以在多个组中。输出如下所示：

In [192]: score_groups.groups

Out[192]: {('a', 'b'): [0],
           ('b', 'a'): [1],
           ('c', 'k'): [2],
           ('d', 'n'): [3],
           ('e', 'c'): [4]}

此外，我需要保留索引，因为我稍后会将它们用于其他操作。请帮忙！

Answer 1

使用例如columns将两个column合并为一个pd.concat()。 s = pd.concat([test['score1'], test['score2'].rename(columns={'score2': 'score1'})]).reset_index() s.columns = ['val', 'grp'] val grp 0 0 a 1 1 b 2 2 c 3 3 d 4 4 e 5 0 b 6 1 a 7 2 k 8 3 n 9 4 c：

.groupby()

然后在'grp'上'val'并在list中收集s = s.groupby('grp').apply(lambda x: x.val.tolist()) a [0, 1] b [1, 0] c [2, 4] d [3] e [4] k [2] n [3]：

dict

或者，如果您更喜欢s.to_dict() {'e': [4], 'd': [3], 'n': [3], 'k': [2], 'a': [0, 1], 'c': [2, 4], 'b': [1, 0]}：

test.unstack().reset_index(-1).groupby(0).apply(lambda x: x.level_1.tolist())

a    [0, 1]
b    [1, 0]
c    [2, 4]
d       [3]
e       [4]
k       [2]
n       [3]

或者，在单个步骤中以相同的效果跳过重命名列：

>>> test2 = [{'A':['a', 'b']}, {'B':'b'}]
>>> yaml.dump(test2)
'- A: [a, b]\n- {B: b}\n'
>>>

Answer 2

使用Stefan的帮助，我解决了这个问题。

In (283): frame1 = test[['score1']]
          frame2 = test[['score2']]
          frame2.rename(columns={'score2': 'score1'}, inplace=True)

          test = pandas.concat([frame1, frame2])

          test

Out[283]:   
   score1
0   a
1   b
2   c
3   d
4   e
0   b
1   a
2   k
3   n
4   c

注意重复的索引。索引已经保留，这就是我想要的。现在，让我们通过运营来开展业务。

In (283): groups = test.groupby('score1')

          groups.get_group('a') # Get group with key a

Out[283]: 
    score1
0   a
1   a

In (283): groups.get_group('b') # Get group with key b

Out[283]: 
    score1
1   b
0   b

In (283): groups.get_group('c') # Get group with key c

Out[283]: 
    score1
2   c
4   c

In (283): groups.get_group('k') # Get group with key k

Out[283]: 
    score1
2   k

我对pandas如何检索具有正确索引的行感到困惑，即使它们是重复的。据我所知，group by operation使用反向索引数据结构将引用（索引）存储到行。任何见解将不胜感激。任何回答此问题的人都会接受他们的回答：）

Answer 3

重新组织数据以便于操作（对于相同数据具有多个值列将始终令您头疼）。

import pandas as pd

test = pd.DataFrame({'score1' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series([2, 1, 8, 9, 3], index=['a', 'b', 'c', 'd', 'e'])})

test['name'] = test.index
result = pd.melt(test, id_vars=['name'], value_vars=['score1', 'score2'])

  name variable  value
0    a   score1      1
1    b   score1      2
2    c   score1      3
3    d   score1      4
4    e   score1      5
5    a   score2      2
6    b   score2      1
7    c   score2      8
8    d   score2      9
9    e   score2      3

现在，您的值只有一列，并且您可以轻松分组或按名称列选择：

   hey = result.groupby('value')
   hey.groups
   #below are the indices that you care about
   {1: [0, 6], 2: [1, 5], 3: [2, 9], 4: [3], 5: [4], 8: [7], 9: [8]}

groupby多个值列

3 个答案: