想象一下我有以下数据框:
import pandas as pd
df = pd.DataFrame({'col1': ['a','b','c','d','e','f','g','h','i','j','k','l'], 'col2': [1,1,1,2,2,3,3,3,4,5,5,6]})
col1 col2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 3
6 g 3
7 h 3
8 i 4
9 j 5
10 k 5
11 l 6
如果我使用此代码:
df[df.col2.isin(df.groupby('col2').size().head(3).index)]
我可以检索col2
中3个最频繁的类别。
编辑:
我想做的是过滤数据帧,以使col2
中只有最频繁的类别保留在col2
中。然后,我想为每个类别创建虚拟列,以指示col1
中每个类别和每个字母有多少个相同类别的记录。
这将是结果数据框:
col1 col2_1 col2_2 col2_3 rest_count
0 a 1 0 0 0
1 b 1 0 0 0
2 c 1 0 0 0
3 d 0 1 0 0
4 e 0 1 0 0
5 f 0 0 1 0
6 g 0 0 1 0
7 h 0 0 1 0
8 i 0 0 0 1
9 j 0 0 0 1
10 k 0 0 0 1
11 l 0 0 0 1
如何在新创建的列rest_count
中存储其余类别的计数??
预先感谢
答案 0 :(得分:1)
def check_top(row, df_top):
"""create extra mask column called top3
it will be used to filter out col2 values"""
if row.col2 in df_top:
row['top3'] = True
else:
row['top3'] = False
return row
def update_cols(row):
"""update col2 and col3 values depending on top3 value"""
if row['top3'] == True:
row['col3'] = None
else:
row['col2'] = None
return row
# get top3 values
df_top = df.groupby('col2').size().head(3).index
df = df.apply(lambda row: check_top(row, df_top), axis=1)
# create col3 column
df['col3'] = df['col2']
df = df.apply(lambda row: update_cols(row), axis=1)
# select the columns that you need
df = df[['col1', 'col2', 'col3']]
答案 1 :(得分:1)
使用:
#get top values
v = df.groupby('col2').size().head(3).index
#create new DataFrame by compare each value
df1 = pd.concat([(df.col2 == x).astype(int) for x in v], axis=1)
#create counter for columns names
df1.columns = ['{}_{}'.format(x, i) for i, x in enumerate(df1.columns, 1)]
#join together with original
df = df.join(df1)
#add column for remain values
df['rest_count'] = (~df.col2.isin(v)).astype(int)
print (df)
col1 col2 col2_1 col2_2 col2_3 rest_count
0 a 1 1 0 0 0
1 b 1 1 0 0 0
2 c 1 1 0 0 0
3 d 2 0 1 0 0
4 e 2 0 1 0 0
5 f 3 0 0 1 0
6 g 3 0 0 1 0
7 h 3 0 0 1 0
8 i 4 0 0 0 1
9 j 5 0 0 0 1
10 k 5 0 0 0 1
11 l 6 0 0 0 1