Question

有一个很棒的solution in R。

我的df.column看起来像是：

Windows
Windows
Mac
Mac
Mac
Linux
Windows
...

我想用＆＃39;其他＆＃39;替换低频类别。在此df.column向量中。例如，我需要df.column看起来像

Windows
Windows
Mac
Mac
Mac
Linux -> Other
Windows
...

我想重命名这些罕见的类别，以减少回归中的因素数量。这就是我需要原始矢量的原因。在python中，运行命令后得到频率表我得到：

pd.value_counts(df.column)


Windows          26083
iOS              19711
Android          13077
Macintosh         5799
Chrome OS          347
Linux              285
Windows Phone      167
(not set)           22
BlackBerry          11

我想知道是否有一种方法可以重命名Chrome OS＆＃39; Linux＆＃39; Linux＆＃39; （低频数据）到另一个类别（例如类别＆＃39;其他＆＃39;，并以有效的方式这样做。

Answer 1

通过查找占用百分比来掩盖，即：

series = pd.value_counts(df.column)
mask = (series/series.sum() * 100).lt(1)
# To replace df['column'] use np.where I.e 
df['column'] = np.where(df['column'].isin(series[mask].index),'Other',df['column'])

使用sum更改索引：

new = series[~mask]
new['Other'] = series[mask].sum()

Windows      26083
iOS          19711
Android      13077
Macintosh     5799
Other          832
Name: 1, dtype: int64

如果要替换索引，则：

series.index = np.where(series.index.isin(series[mask].index),'Other',series.index)

Windows      26083
iOS          19711
Android      13077
Macintosh     5799
Other          347
Other          285
Other          167
Other           22
Other           11
Name: 1, dtype: int64

解释

(series/series.sum() * 100) # This will give you the percentage i.e 

Windows          39.820158
iOS              30.092211
Android          19.964276
Macintosh         8.853165
Chrome OS         0.529755
Linux             0.435101
Windows Phone     0.254954
(not set)         0.033587
BlackBerry        0.016793
Name: 1, dtype: float64

.lt(1)相当于小于1.这为您提供了一个布尔掩码，基于该掩码索引并分配数据

Answer 2

这是您问题的（最新）扩展；它将低频类别（比例小于min_freq的组合）应用于整个数据帧的列。它基于@Bharath的答案。

def condense_category(col, min_freq=0.01, new_name='other'):
    series = pd.value_counts(col)
    mask = (series/series.sum()).lt(min_freq)
    return pd.Series(np.where(col.isin(series[mask].index), new_name, col))

一个简单的应用示例：

df_toy = pd.DataFrame({'x': [1, 2, 3, 4] + [5]*100, 'y': [5, 6, 7, 8] + [0]*100})
df_toy = df_toy.apply(condense_category, axis=0)
print(df_toy)

#          x      y
# 0    other  other
# 1    other  other
# 2    other  other
# 3    other  other
# 4        5      0
# ..     ...    ...
# 99       5      0
# 100      5      0
# 101      5      0
# 102      5      0
# 103      5      0
# 
# [104 rows x 2 columns]

Python：结合低频因子/类别计数

2 个答案: