Question

如何替换很少发生的pandas.DataFrame中的某些列的值，即低频率（忽略NaN）？

例如，在以下数据框中，假设我想要替换在其各自列中出现少于三次的A列或B列中的任何值。我想用＆＃34;其他＆＃34;：

替换这些

import pandas as pd
import numpy as np

df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog',pd.np.nan, 'emu', 'emu']})
df
   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | peach | dog |
cherry | cat   | NaN |
NaN    | cat   | emu |
ant    | peach | emu |

换句话说，在A列和B列中，我想要替换那些出现两次或更少的值（但只留下NaNs）。

所以我想要的输出是：

   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | other | dog |
other  | cat   | NaN |
NaN    | cat   | emu |
ant    | other | emu |

这与之前发布的问题有关：Remove low frequency values from pandas.dataframe

但是那里的解决方案导致了一个＆＃34;属性错误：＆＃39; NoneType＆＃39;对象没有属性＆＃39; any。＆＃39;＆＃34; （我想因为我有NaN值？）

Answer 1

这与Change values in pandas dataframe according to value_counts()非常相似。您可以向lambda函数添加一个条件，以排除列'C'，如下所示：

df.apply(lambda x: x.mask(x.map(x.value_counts())<3, 'other') if x.name!='C' else x)
Out: 
       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3    NaN    cat  emu
4    ant  other  emu

这基本上遍历列。对于每个列，它会生成值计数并使用该系列进行映射。这允许x.mask检查计数是否小于3的条件。如果是这种情况，则返回“其他”，如果不是，则使用实际值。最后，条件检查列名称。

lambda的条件可以针对多列进行推广，方法是将其从x.name not in 'CDEF'更改为x.name not in ['C', 'D', 'E', 'F']或x.name!='C'。

Answer 2

使用辅助函数和def replace_low_freq(df, threshold=2, replacement='other'): s = df.stack() c = s.value_counts() m = pd.Series(replacement, c.index[c <= threshold]) return s.replace(m).unstack() cols = list('AB') replace_low_freq(df[cols]).join(df.drop(cols, 1)) A B C 0 ant cat dog 1 ant other dog 2 other cat NaN 3 None cat emu 4 ant other emu

@media only screen and (max-width: 100%){
  .row,
  .hero-img{
    width: 100%;
  }
}

@media only screen and (max-width: 100%){
  /* code */ 
}

Answer 3

您可以使用：

#added one last row for complicated df
df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant', 'd'], 
                   'B':['cat','peach', 'cat', 'cat', 'peach', 'm'], 
                   'C':['dog','dog',pd.np.nan, 'emu', 'emu', 'k']})
print (df)
        A      B    C
0     ant    cat  dog
1     ant  peach  dog
2  cherry    cat  NaN
3     NaN    cat  emu
4     ant  peach  emu
5       d      m    k

使用value_counts和boolean indexing查找替换的所有值：

a = df.A.value_counts()
a = a[a < 3].index
print (a)
Index(['cherry', 'd'], dtype='object')

b = df.B.value_counts()
b = b[b < 3].index
print (b)
Index(['peach', 'm'], dtype='object')

然后replace与dict comprehension df.A = df.A.replace({x:'other' for x in a}) df.B = df.B.replace({x:'other' for x in b}) print (df) A B C 0 ant cat dog 1 ant other dog 2 other cat NaN 3 NaN cat emu 4 ant other emu 5 other other k如果有更多值替换：

cols = ['A','B']
for col in cols:
    val = df[col].value_counts()
    y = val[val < 3].index
    df[col] = df[col].replace({x:'other' for x in y})
print (df)
       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3    NaN    cat  emu
4    ant  other  emu
5  other  other    k

一起循环：

class OneRegister < Sinatra::Base
    # helpers here
end

class SecondRegister < Sinatra::Base
    # helpers here
end

替换pandas.dataframe中的低频分类值，同时忽略NaN

3 个答案: