替换pandas.dataframe中的低频分类值,同时忽略NaN

时间:2017-01-10 20:07:59

标签: python-3.x pandas

如何替换很少发生的pandas.DataFrame中的某些列的值,即低频率(忽略NaN)?

例如,在以下数据框中,假设我想要替换在其各自列中出现少于三次的A列或B列中的任何值。我想用"其他":

替换这些
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog',pd.np.nan, 'emu', 'emu']})
df
   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | peach | dog |
cherry | cat   | NaN |
NaN    | cat   | emu |
ant    | peach | emu |

换句话说,在A列和B列中,我想要替换那些出现两次或更少的值(但只留下NaNs)。

所以我想要的输出是:

   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | other | dog |
other  | cat   | NaN |
NaN    | cat   | emu |
ant    | other | emu |

这与之前发布的问题有关:Remove low frequency values from pandas.dataframe

但是那里的解决方案导致了一个"属性错误:' NoneType'对象没有属性' any。'" (我想因为我有NaN值?)

3 个答案:

答案 0 :(得分:4)

这与Change values in pandas dataframe according to value_counts()非常相似。您可以向lambda函数添加一个条件,以排除列'C',如下所示:

df.apply(lambda x: x.mask(x.map(x.value_counts())<3, 'other') if x.name!='C' else x)
Out: 
       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3    NaN    cat  emu
4    ant  other  emu

这基本上遍历列。对于每个列,它会生成值计数并使用该系列进行映射。这允许x.mask检查计数是否小于3的条件。如果是这种情况,则返回“其他”,如果不是,则使用实际值。最后,条件检查列名称。

lambda的条件可以针对多列进行推广,方法是将其从x.name not in 'CDEF'更改为x.name not in ['C', 'D', 'E', 'F']x.name!='C'

答案 1 :(得分:3)

使用辅助函数和def replace_low_freq(df, threshold=2, replacement='other'): s = df.stack() c = s.value_counts() m = pd.Series(replacement, c.index[c <= threshold]) return s.replace(m).unstack() cols = list('AB') replace_low_freq(df[cols]).join(df.drop(cols, 1)) A B C 0 ant cat dog 1 ant other dog 2 other cat NaN 3 None cat emu 4 ant other emu

@media only screen and (max-width: 100%){
  .row,
  .hero-img{
    width: 100%;
  }
}

@media only screen and (max-width: 100%){
  /* code */ 
}

答案 2 :(得分:2)

您可以使用:

#added one last row for complicated df
df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant', 'd'], 
                   'B':['cat','peach', 'cat', 'cat', 'peach', 'm'], 
                   'C':['dog','dog',pd.np.nan, 'emu', 'emu', 'k']})
print (df)
        A      B    C
0     ant    cat  dog
1     ant  peach  dog
2  cherry    cat  NaN
3     NaN    cat  emu
4     ant  peach  emu
5       d      m    k

使用value_countsboolean indexing查找替换的所有值:

a = df.A.value_counts()
a = a[a < 3].index
print (a)
Index(['cherry', 'd'], dtype='object')

b = df.B.value_counts()
b = b[b < 3].index
print (b)
Index(['peach', 'm'], dtype='object')

然后replacedict comprehension df.A = df.A.replace({x:'other' for x in a}) df.B = df.B.replace({x:'other' for x in b}) print (df) A B C 0 ant cat dog 1 ant other dog 2 other cat NaN 3 NaN cat emu 4 ant other emu 5 other other k 如果有更多值替换:

cols = ['A','B']
for col in cols:
    val = df[col].value_counts()
    y = val[val < 3].index
    df[col] = df[col].replace({x:'other' for x in y})
print (df)
       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3    NaN    cat  emu
4    ant  other  emu
5  other  other    k

一起循环:

class OneRegister < Sinatra::Base
    # helpers here
end

class SecondRegister < Sinatra::Base
    # helpers here
end