基于前N个值的Pandas中的多个列

时间:2018-03-14 18:55:47

标签: python pandas

我想遍历多个数据帧列,查找每列中的前n个值。如果列中的值位于前n个值中,则保留该值,否则将其置于“其他”中。另外,我想从中创建新列。

但是,我不确定如何在这种情况下使用.apply,因为我似乎需要引用列和行。

np.random.seed(0)
example_df = pd.DataFrame(np.random.randint(low=0, high=10, size=(15, 5)),columns=['a', 'b', 'c', 'd', 'e'])
cols_to_group = ['a','b','c']
top = 2

因此,对于下面的示例,这是我的伪代码,我不确定如何执行:

伪代码:

#loop through each column
for column in example_df[cols_to_group]:
    #loop through each value in column and check if it's in top values for the column. 
    for single_value in column:
        if single_value.isin(column.value_counts()[:top].values):
            #return value if it is in top values
            return single_value
        else:
            return "other"
    #create new column in your df that has bucketed values
    example_df[column.name + str("bucketed")+ str(top)]=column

预期产出:

top = 2的粗略例子。

    a   b   c   d   e   a_bucketed b_bucketed
0   4   6   4   3   1     4          6
1   8   8   1   5   7     8          8 
2   8   6   0   0   2     8          6
3   4   1   0   7   4     4          Other
4   7   8   7   7   7     Other      8

1 个答案:

答案 0 :(得分:1)

这是一种方法。但没有规定治疗关系。

df['a_bucketed'] = np.where(df['a'].isin(df['a'].value_counts().index[:2]), df['a'], 'Other')
df['b_bucketed'] = np.where(df['b'].isin(df['b'].value_counts().index[:2]), df['b'], 'Other')

#     a  b  c  d  e a_bucketed b_bucketed
# 0   5  0  3  3  7      Other      Other
# 1   9  3  5  2  4          9          3
# 2   7  6  8  8  1      Other      Other
# 3   6  7  7  8  1      Other      Other
# 4   5  9  8  9  4      Other          9
# 5   3  0  3  5  0          3      Other
# 6   2  3  8  1  3      Other          3
# 7   3  3  7  0  1          3          3
# 8   9  9  0  4  7          9          9
# 9   3  2  7  2  0          3      Other
# 10  0  4  5  5  6      Other      Other
# 11  8  4  1  4  9      Other      Other
# 12  8  1  1  7  9      Other      Other
# 13  9  3  6  7  2          9          3
# 14  0  3  5  9  4      Other          3