将同一列中的多个项目分组

时间:2019-09-16 23:35:01

标签: python pandas

我试图将多个列中的多个项目递归分组。想知道是否有人可以帮助我。

以下是示例。

import pandas as pd
import itertools

# i woiuld have more than two groups
data = {'group1': ['a'] * 3 + ['b'] * 3,
        'group2': list(range(1,4)) + list(range(1,4)),
        'num': [1, 2, 3, 10, 15, 20]} 

df = pd.DataFrame(data)
print(df)

desired_df = {'group1': ['a'] * 9,
              'group2': ['b'] * 9,
              'num_group1': list(range(1,4))*3,
              'num_group2': list(itertools.chain.from_iterable(itertools.repeat(x, 3) for x in list(range(1,4)))),
              'desired_column': [11, 12, 13, 16, 17, 18, 21, 22, 23]
              }
# desired column is the sum of 'num' from 'group1' and 'group2' in df
desired = pd.DataFrame(desired_df)
print(desired)   



# i have tried this...which obviously doesnt work
data1 = df.merge(df.drop(columns=['num']), left_on=['group1'], right_on=['group1'])
data1.groupby(['group2_x', 'group2_y'])['num'].sum()

我确定我缺少一些简单的东西...有什么建议吗?

3 个答案:

答案 0 :(得分:1)

您可以group_names = ['group1', 'group2'] groups = (x[1][group_names].values for x in df.groupby('group1')) combined = [tuple(tuple(y) for y in x) for x in it.product(*groups)] df.set_index(group_names, inplace=True) result = pd.Series([sum(df.loc[x, 'num'] for x in item) for item in combined], index=pd.MultiIndex.from_tuples(combined)) print(result) ,然后计算所有组之间相关列的组合,最后使用这些项目来索引原始数据帧:

(a, 1)  (b, 1)    11
        (b, 2)    16
        (b, 3)    21
(a, 2)  (b, 1)    12
        (b, 2)    17
        (b, 3)    22
(a, 3)  (b, 1)    13
        (b, 2)    18
        (b, 3)    23

这将产生以下结果(索引对应于组合):

  group1  group2  num
0      a       1    1
1      a       2    2
2      a       3    3
3      b       1   10
4      b       2   15
5      b       3   20
6      c       1  100
7      c       2  200
8      c       3  300

(a, 1)  (b, 1)  (c, 1)    111
                (c, 2)    211
                (c, 3)    311
        (b, 2)  (c, 1)    116
                (c, 2)    216
                (c, 3)    316
        (b, 3)  (c, 1)    121
                (c, 2)    221
                (c, 3)    321
(a, 2)  (b, 1)  (c, 1)    112
                (c, 2)    212
                (c, 3)    312
        (b, 2)  (c, 1)    117
                (c, 2)    217
                (c, 3)    317
        (b, 3)  (c, 1)    122
                (c, 2)    222
                (c, 3)    322
(a, 3)  (b, 1)  (c, 1)    113
                (c, 2)    213
                (c, 3)    313
        (b, 2)  (c, 1)    118
                (c, 2)    218
                (c, 3)    318
        (b, 3)  (c, 1)    123
                (c, 2)    223
                (c, 3)    323

这也适用于两个以上的组,例如:

  group1  group2 group3  num
0      a       0      q    1
1      a       1      r    2
2      a       0      s    3
3      a       1      t    4
4      b       0      q   10
5      b       1      r   15
6      b       0      s   20
7      b       1      t   25

(a, 0, q)  (b, 0, q)    11
           (b, 1, r)    16
           (b, 0, s)    21
           (b, 1, t)    26
(a, 1, r)  (b, 0, q)    12
           (b, 1, r)    17
           (b, 0, s)    22
           (b, 1, t)    27
(a, 0, s)  (b, 0, q)    13
           (b, 1, r)    18
           (b, 0, s)    23
           (b, 1, t)    28
(a, 1, t)  (b, 0, q)    14
           (b, 1, r)    19
           (b, 0, s)    24
           (b, 1, t)    29

它也适用于两列以上,例如:

private void Filtriraj_clanove_Click(object sender, EventArgs e)
    {
      (baza_filter_clanova.DataSource as DataTable).DefaultView.RowFilter = string.Format("ime LIKE '{0}%' AND prezime LIKE '{1}%'", pretraga_ime.Text, pretraga_prezime.Text);
    }

答案 1 :(得分:1)

您可以使用

x, y = [y.assign(key=1) for x , y in df.groupby('group1')]
s=x.merge(y,on='key')
s['X']=s.num_x+s.num_y
s
  group1_x  group2_x  num_x  key group1_y  group2_y  num_y   X
0        a         1      1    1        b         1     10  11
1        a         1      1    1        b         2     15  16
2        a         1      1    1        b         3     20  21
3        a         2      2    1        b         1     10  12
4        a         2      2    1        b         2     15  17
5        a         2      2    1        b         3     20  22
6        a         3      3    1        b         1     10  13
7        a         3      3    1        b         2     15  18
8        a         3      3    1        b         3     20  23

答案 2 :(得分:1)

您可以尝试使用itertools中的组合:

from itertools import combinations
df2=pd.DataFrame([list(key)[0] +list(key)[1] for key in combinations(df.values.tolist(),2)])
df3=df2[df2[0].ne(df2[3])].reset_index(drop=True)
df3[5]=df3[5]+df3[2]
print(df3) 
df4=df3[[0,1,3,4,5]].reindex(columns=[0,3,1,4,5]).rename(columns={3:'group2',0:'group1',1:'num_group1',4:'num_group2',5:'desired_column'})
df_desired=df4.sort_values('desired_column').reset_index(drop=True)
print(df_desired)

输出:

   0  1  2  3  4   5
0  a  1  1  b  1  11
1  a  1  1  b  2  16
2  a  1  1  b  3  21
3  a  2  2  b  1  12
4  a  2  2  b  2  17
5  a  2  2  b  3  22
6  a  3  3  b  1  13
7  a  3  3  b  2  18
8  a  3  3  b  3  23
  group1 group2  num_group1  num_group2  desired_column
0      a      b           1           1              11
1      a      b           2           1              12
2      a      b           3           1              13
3      a      b           1           2              16
4      a      b           2           2              17
5      a      b           3           2              18
6      a      b           1           3              21
7      a      b           2           3              22
8      a      b           3           3              23