Question

我想创建一个函数，该函数接受一个数据框，并用分类列中的模式替换NaN，并用该列的平均值替换数字列中的NaN。如果分类列中有多个模式，则应使用第一种模式。

我已经用下面的代码做到了：

def exercise4(df):
    df1 = df.select_dtypes(np.number)
    df2 = df.select_dtypes(exclude = 'float')
    mode = df2.mode()
    df3 = df1.fillna(df.mean())
    df4 = df2.fillna(mode.iloc[0,:])
    new_df = [df3,df4]
    df5 = pd.concat(new_df,axis=1)
    new_cols = list(df.columns)
    df6 = df5[new_cols]
    return df6

但是我敢肯定有更简单的方法可以做到这一点？

Answer 1

您可以使用：

df = pd.DataFrame({
        'A':list('abcdec'),
         'B':[4,5,4,5,5,4],
         'C':[7,8,9,4,2,3],
         'D':[1,3,5,7,1,0],
         'E':list('bbcdeb'),
})
df.iloc[[1,3], [1,2,0,4]] = np.nan

print (df)
     A    B    C  D    E
0    a  4.0  7.0  1    b
1  NaN  NaN  NaN  3  NaN
2    c  4.0  9.0  5    c
3  NaN  NaN  NaN  7  NaN
4    e  5.0  2.0  1    e
5    c  4.0  3.0  0    b

想法是将DataFrame.select_dtypes用于带有DataFrame.mode的非数字列，并按DataFrame.iloc选择第一行作为位置，然后计算means-默认情况下排除非数字，因此可能对系列使用Series.append，并将所有替换值传递给DataFrame.fillna：

modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
means = df.mean()
both = modes.append(means)
print (both)
A          c
E          b
B       4.25
C       5.25
D    2.83333
dtype: object

df.fillna(both, inplace=True)
print (df)
   A     B     C  D  E
0  a  4.00  7.00  1  b
1  c  4.25  5.25  3  b
2  c  4.00  9.00  5  c
3  c  4.25  5.25  7  b
4  e  5.00  2.00  1  e
5  c  4.00  3.00  0  b

通过DataFrame.pipe传递给功能：

def exercise4(df):
    modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
    means = df.mean()
    both = modes.append(means)
    df.fillna(both, inplace=True)
    return df

df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)
   A     B     C  D  E
0  a  4.00  7.00  1  b
1  c  4.25  5.25  3  b
2  c  4.00  9.00  5  c
3  c  4.25  5.25  7  b
4  e  5.00  2.00  1  e
5  c  4.00  3.00  0  b

另一种想法是使用DataFrame.apply，但必须由result_type='expand'使用带有测试dtypes的types.is_numeric_dtype参数：

from pandas.api.types import is_numeric_dtype

f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
df.fillna(df.apply(f, result_type='expand'), inplace=True)
print (df)
   A     B     C  D  E
0  a  4.00  7.00  1  b
1  c  4.25  5.25  3  b
2  c  4.00  9.00  5  c
3  c  4.25  5.25  7  b
4  e  5.00  2.00  1  e
5  c  4.00  3.00  0  b

传递给函数：

from pandas.api.types import is_numeric_dtype

def exercise4(df):
    f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
    df.fillna(df.apply(f, result_type='expand'), inplace=True)
    return df

df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)

Answer 2

实际上，您已经拥有了所有食材！您的某些步骤可以链接在一起，但有些步骤可能会过时。

例如看这两行：

root: This is start of a text followed by a link sampleand ending with some text.
html: This is start of a text followed by a link sampleand ending with some text.
head: 
body: This is start of a text followed by a link sampleand ending with some text.
p: This is start of a text followed by a link sampleand ending with some text.
a: followed by a link sample

您可以将它们替换为body: This is start of a text  a:followed by a link sample body:and ending with some text.。然后，您可以进行mode = df2.mode() df4 = df2.fillna(mode.iloc[0,:])这些更改，而不是将新的（子）数据帧不断地重新分配给变量，而是对其进行更改和连接，这意味着它们将直接应用于相关的数据帧。最后，df4 = df2.fillna(df2.mode().iloc[0,:]可以在您的特定（示例）情况下工作，但是如果数据帧中还有更多数据类型怎么办？可能是字符串列？

我的建议：

inplace

Answer 3

您可以使用_get_numeric_data()方法来获取数字列（并因此获得分类列）：

numerical_col = df._get_numeric_data().columns

这时，您只需要使用贯穿各列的apply函数的一行代码即可：

fixed_df = df.apply(lambda col: col.fillna(col.mean()) if col.name in numerical_col else col.fillna(col.mode()[0]), axis=0)

Answer 4

您可以按以下方式工作：

df = df.apply(lambda x: x.fillna(x.mode()[0]) if (x.dtypes==category) else  x.fillna(x.mean()) )

用熊猫替换数据框中的NaN值

4 个答案: