通过“ OTHER” Python重命名不太频繁的类别

时间:2018-12-06 09:29:07

标签: python pandas dataframe counter categorical-data

在我的数据框中,我有一些分类列,其中包含100多个不同的类别。我想按最频繁的类别进行排名。我保留了前9个最频繁的类别,而较不频繁的类别则通过以下方式自动重命名:OTHER

示例:

这是我的df:

print(df)

    Employee_number                 Jobrol
0                 1        Sales Executive
1                 2     Research Scientist
2                 3  Laboratory Technician
3                 4        Sales Executive
4                 5     Research Scientist
5                 6  Laboratory Technician
6                 7        Sales Executive
7                 8     Research Scientist
8                 9  Laboratory Technician
9                10        Sales Executive
10               11     Research Scientist
11               12  Laboratory Technician
12               13        Sales Executive
13               14     Research Scientist
14               15  Laboratory Technician
15               16        Sales Executive
16               17     Research Scientist
17               18     Research Scientist
18               19                Manager
19               20        Human Resources
20               21        Sales Executive


valCount = df['Jobrol'].value_counts()

valCount

Sales Executive          7
Research Scientist       7
Laboratory Technician    5
Manager                  1
Human Resources          1

我保留前3个类别,然后用“ OTHER”重命名其余类别,应该如何进行?

谢谢。

3 个答案:

答案 0 :(得分:3)

将您的系列转换为分类类别,提取计数不在前3位的类别,然后添加一个新类别,例如'Other',然后替换先前计算的类别:

df['Jobrol'] = df['Jobrol'].astype('category')

others = df['Jobrol'].value_counts().index[3:]
label = 'Other'

df['Jobrol'] = df['Jobrol'].cat.add_categories([label])
df['Jobrol'] = df['Jobrol'].replace(others, label)

注意:通过df['Jobrol'].cat.rename_categories(dict.fromkeys(others, label))重命名类别来组合类别是很诱人的,但这是行不通的,因为这意味着多个标记相同的类别,不可能。


上述解决方案可以调整为按 count 进行过滤。例如,要仅包括计数为1的类别,可以这样定义others

counts = df['Jobrol'].value_counts()
others = counts[counts == 1].index

答案 1 :(得分:2)

value_countsnumpy.where一起使用:

need = df['Jobrol'].value_counts().index[:3]
df['Jobrol'] = np.where(df['Jobrol'].isin(need), df['Jobrol'], 'OTHER')

valCount = df['Jobrol'].value_counts()
print (valCount)
Research Scientist       7
Sales Executive          7
Laboratory Technician    5
OTHER                    2
Name: Jobrol, dtype: int64

另一种解决方案:

N = 3
s = df['Jobrol'].value_counts()
valCount = s.iloc[:N].append(pd.Series(s.iloc[N:].sum(), index=['OTHER']))
print (valCount)
Research Scientist       7
Sales Executive          7
Laboratory Technician    5
OTHER                    2
dtype: int64

答案 2 :(得分:0)

单行解决方案:

limit = 500
df['Jobrol'] = df['Jobrol'].map({x[0]: x[0] if x[1] > limit else 'other' for x in dict(df['Jobrol'].value_counts()).items()})