我想遍历多个数据帧列,查找每列中的前n个值。如果列中的值位于前n个值中,则保留该值,否则将其置于“其他”中。另外,我想从中创建新列。
但是,我不确定如何在这种情况下使用.apply
,因为我似乎需要引用列和行。
np.random.seed(0)
example_df = pd.DataFrame(np.random.randint(low=0, high=10, size=(15, 5)),columns=['a', 'b', 'c', 'd', 'e'])
cols_to_group = ['a','b','c']
top = 2
因此,对于下面的示例,这是我的伪代码,我不确定如何执行:
伪代码:
#loop through each column
for column in example_df[cols_to_group]:
#loop through each value in column and check if it's in top values for the column.
for single_value in column:
if single_value.isin(column.value_counts()[:top].values):
#return value if it is in top values
return single_value
else:
return "other"
#create new column in your df that has bucketed values
example_df[column.name + str("bucketed")+ str(top)]=column
预期产出:
top = 2的粗略例子。
a b c d e a_bucketed b_bucketed
0 4 6 4 3 1 4 6
1 8 8 1 5 7 8 8
2 8 6 0 0 2 8 6
3 4 1 0 7 4 4 Other
4 7 8 7 7 7 Other 8
答案 0 :(得分:1)
这是一种方法。但没有规定治疗关系。
df['a_bucketed'] = np.where(df['a'].isin(df['a'].value_counts().index[:2]), df['a'], 'Other')
df['b_bucketed'] = np.where(df['b'].isin(df['b'].value_counts().index[:2]), df['b'], 'Other')
# a b c d e a_bucketed b_bucketed
# 0 5 0 3 3 7 Other Other
# 1 9 3 5 2 4 9 3
# 2 7 6 8 8 1 Other Other
# 3 6 7 7 8 1 Other Other
# 4 5 9 8 9 4 Other 9
# 5 3 0 3 5 0 3 Other
# 6 2 3 8 1 3 Other 3
# 7 3 3 7 0 1 3 3
# 8 9 9 0 4 7 9 9
# 9 3 2 7 2 0 3 Other
# 10 0 4 5 5 6 Other Other
# 11 8 4 1 4 9 Other Other
# 12 8 1 1 7 9 Other Other
# 13 9 3 6 7 2 9 3
# 14 0 3 5 9 4 Other 3