如何获取一列的最新类别并将剩余的计数存储在另一列中

时间:2019-01-10 09:57:10

标签: python python-3.x pandas dataframe

想象一下我有以下数据框:

import pandas as pd

df = pd.DataFrame({'col1': ['a','b','c','d','e','f','g','h','i','j','k','l'], 'col2': [1,1,1,2,2,3,3,3,4,5,5,6]})

    col1    col2
0      a       1
1      b       1
2      c       1
3      d       2
4      e       2
5      f       3
6      g       3
7      h       3
8      i       4
9      j       5
10     k       5
11     l       6

如果我使用此代码:

df[df.col2.isin(df.groupby('col2').size().head(3).index)]

我可以检索col2中3个最频繁的类别。

编辑

我想做的是过滤数据帧,以使col2中只有最频繁的类别保留在col2中。然后,我想为每个类别创建虚拟列,以指示col1中每个类别和每个字母有多少个相同类别的记录。

这将是结果数据框:

    col1    col2_1  col2_2  col2_3  rest_count
0      a         1       0       0           0
1      b         1       0       0           0
2      c         1       0       0           0
3      d         0       1       0           0
4      e         0       1       0           0
5      f         0       0       1           0
6      g         0       0       1           0
7      h         0       0       1           0
8      i         0       0       0           1
9      j         0       0       0           1       
10     k         0       0       0           1
11     l         0       0       0           1

如何在新创建的列rest_count中存储其余类别的计数??

预先感谢

2 个答案:

答案 0 :(得分:1)

def check_top(row, df_top):
    """create extra mask column called top3
    it will be used to filter out col2 values"""

    if row.col2 in df_top:
        row['top3'] = True
    else:
        row['top3'] = False
    return row

def update_cols(row):
    """update col2 and col3 values depending on top3 value"""

    if row['top3'] == True:
        row['col3'] = None
    else:
        row['col2'] = None
    return row

# get top3 values
df_top = df.groupby('col2').size().head(3).index
df = df.apply(lambda row: check_top(row, df_top), axis=1) 

# create col3 column
df['col3'] = df['col2']

df = df.apply(lambda row: update_cols(row), axis=1)

# select the columns that you need
df = df[['col1', 'col2', 'col3']]

答案 1 :(得分:1)

使用:

#get top values
v = df.groupby('col2').size().head(3).index
#create new DataFrame by compare each value
df1 = pd.concat([(df.col2 == x).astype(int) for x in v], axis=1)
#create counter for columns names
df1.columns = ['{}_{}'.format(x, i) for i, x in enumerate(df1.columns, 1)]
#join together with original
df = df.join(df1)
#add column for remain values
df['rest_count'] = (~df.col2.isin(v)).astype(int)
print (df)
   col1  col2  col2_1  col2_2  col2_3  rest_count
0     a     1       1       0       0           0
1     b     1       1       0       0           0
2     c     1       1       0       0           0
3     d     2       0       1       0           0
4     e     2       0       1       0           0
5     f     3       0       0       1           0
6     g     3       0       0       1           0
7     h     3       0       0       1           0
8     i     4       0       0       0           1
9     j     5       0       0       0           1
10    k     5       0       0       0           1
11    l     6       0       0       0           1