创建python函数在pandas中创建分类箱

时间:2018-06-29 11:10:07

标签: python python-2.7 pandas dataframe

我正在尝试在python 2.7(pandas)中创建一个可重用的函数以形成分类垃圾箱,即将较小价值类别分组为``其他''。有人可以帮我创建以下功能吗:col1,col2等是不同的分类变量列。

##Reducing categories by binning categorical variables - column1
a = df.col1.value_counts()
#get top 5 values of index
vals = a[:5].index
df['col1_new'] = df.col1.where(df.col1.isin(vals), 'other')
df = df.drop(['col1'],axis=1)

##Reducing categories by binning categorical variables - column2
a = df.col2.value_counts()
#get top 6 values of index
vals = a[:6].index
df['col2_new'] = df.col2.where(df.col2.isin(vals), 'other')
df = df.drop(['col2'],axis=1)

1 个答案:

答案 0 :(得分:1)

您可以使用:

df = pd.DataFrame({'A':list('abcdefabcdefabffeg'),
                   'D':[1,3,5,7,1,0,1,3,5,7,1,0,1,3,5,7,1,0]})

print (df)
    A  D
0   a  1
1   b  3
2   c  5
3   d  7
4   e  1
5   f  0
6   a  1
7   b  3
8   c  5
9   d  7
10  e  1
11  f  0
12  a  1
13  b  3
14  f  5
15  f  7
16  e  1
17  g  0

def replace_under_top(df, c, n):
    a = df[c].value_counts()
    #get top n values of index
    vals = a[:n].index
    #assign columns back
    df[c] = df[c].where(df[c].isin(vals), 'other')
    #rename processes column
    df = df.rename(columns={c : c + '_new'})
    return df

测试:

df1 = replace_under_top(df, 'A', 3)
print (df1)
    A_new  D
0   other  1
1       b  3
2   other  5
3   other  7
4       e  1
5       f  0
6   other  1
7       b  3
8   other  5
9   other  7
10      e  1
11      f  0
12  other  1
13      b  3
14      f  5
15      f  7
16      e  1
17  other  0

df2 = replace_under_top(df, 'D', 4)
print (df2)
        A  D_new
0   other      1
1       b      3
2   other      5
3   other      7
4       e      1
5       f  other
6   other      1
7       b      3
8   other      5
9   other      7
10      e      1
11      f  other
12  other      1
13      b      3
14      f      5
15      f      7
16      e      1
17  other  other