如何替换很少发生的pandas.DataFrame中的某些列的值,即低频率(忽略NaN)?
例如,在以下数据框中,假设我想要替换在其各自列中出现少于三次的A列或B列中的任何值。我想用"其他":
替换这些import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog',pd.np.nan, 'emu', 'emu']})
df
A | B | C |
----------------------
ant | cat | dog |
ant | peach | dog |
cherry | cat | NaN |
NaN | cat | emu |
ant | peach | emu |
换句话说,在A列和B列中,我想要替换那些出现两次或更少的值(但只留下NaNs)。
所以我想要的输出是:
A | B | C |
----------------------
ant | cat | dog |
ant | other | dog |
other | cat | NaN |
NaN | cat | emu |
ant | other | emu |
这与之前发布的问题有关:Remove low frequency values from pandas.dataframe
但是那里的解决方案导致了一个"属性错误:' NoneType'对象没有属性' any。'" (我想因为我有NaN值?)
答案 0 :(得分:4)
这与Change values in pandas dataframe according to value_counts()非常相似。您可以向lambda函数添加一个条件,以排除列'C',如下所示:
df.apply(lambda x: x.mask(x.map(x.value_counts())<3, 'other') if x.name!='C' else x)
Out:
A B C
0 ant cat dog
1 ant other dog
2 other cat NaN
3 NaN cat emu
4 ant other emu
这基本上遍历列。对于每个列,它会生成值计数并使用该系列进行映射。这允许x.mask
检查计数是否小于3的条件。如果是这种情况,则返回“其他”,如果不是,则使用实际值。最后,条件检查列名称。
lambda的条件可以针对多列进行推广,方法是将其从x.name not in 'CDEF'
更改为x.name not in ['C', 'D', 'E', 'F']
或x.name!='C'
。
答案 1 :(得分:3)
使用辅助函数和def replace_low_freq(df, threshold=2, replacement='other'):
s = df.stack()
c = s.value_counts()
m = pd.Series(replacement, c.index[c <= threshold])
return s.replace(m).unstack()
cols = list('AB')
replace_low_freq(df[cols]).join(df.drop(cols, 1))
A B C
0 ant cat dog
1 ant other dog
2 other cat NaN
3 None cat emu
4 ant other emu
@media only screen and (max-width: 100%){
.row,
.hero-img{
width: 100%;
}
}
@media only screen and (max-width: 100%){
/* code */
}
答案 2 :(得分:2)
您可以使用:
#added one last row for complicated df
df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant', 'd'],
'B':['cat','peach', 'cat', 'cat', 'peach', 'm'],
'C':['dog','dog',pd.np.nan, 'emu', 'emu', 'k']})
print (df)
A B C
0 ant cat dog
1 ant peach dog
2 cherry cat NaN
3 NaN cat emu
4 ant peach emu
5 d m k
使用value_counts
和boolean indexing
查找替换的所有值:
a = df.A.value_counts()
a = a[a < 3].index
print (a)
Index(['cherry', 'd'], dtype='object')
b = df.B.value_counts()
b = b[b < 3].index
print (b)
Index(['peach', 'm'], dtype='object')
然后replace
与dict comprehension
df.A = df.A.replace({x:'other' for x in a})
df.B = df.B.replace({x:'other' for x in b})
print (df)
A B C
0 ant cat dog
1 ant other dog
2 other cat NaN
3 NaN cat emu
4 ant other emu
5 other other k
如果有更多值替换:
cols = ['A','B']
for col in cols:
val = df[col].value_counts()
y = val[val < 3].index
df[col] = df[col].replace({x:'other' for x in y})
print (df)
A B C
0 ant cat dog
1 ant other dog
2 other cat NaN
3 NaN cat emu
4 ant other emu
5 other other k
一起循环:
class OneRegister < Sinatra::Base
# helpers here
end
class SecondRegister < Sinatra::Base
# helpers here
end