Question

我在pandas中有一个数据框，我试图从同一行和不同的列中获取数据，并在我的数据中填充NaN值。我要如何在熊猫中做到这一点？

例如，

      1     2   3     4     5   6   7  8  9  10  11    12    13  14    15    16
83  27.0  29.0 NaN  29.0  30.0 NaN NaN  15.0 16.0  17.0 NaN  28.0  30.0 NaN  28.0  18.0

目标是使数据看起来像这样：

      1     2   3     4     5   6   7  ...    10  11    12    13  14    15    16
83  NaN  NaN NaN  27.0  29.0 29.0 30.0  ...  15.0 16.0  17.0  28.0 30.0  28.0  18.0

目标是能够取具有数据的最后五列的平均值。如果没有> = 5个数据填充的单元，则取存在的所有单元的平均值。

Answer 1

假设您需要将所有NaN移至第一列，我将定义一个函数，该函数将所有NaN移至第一列，其余部分保留不变：

def fun(row):
    index_order = row.index[row.isnull()].append(row.index[~row.isnull()])
    row.iloc[:] = row[index_order].values
    return row

df_fix = df.loc[:,df.columns[1:]].apply(fun, axis=1)

如果您需要在同一数据框中覆盖结果，则：

df.loc[:,df.columns[1:]] = df_fix.copy()

Answer 2

使用功能justify来提高性能，而无需先过滤DataFrame.iloc来过滤所有列：

print (df)
   name     1     2   3     4     5   6   7     8     9    10  11    12    13  \
80  bob  27.0  29.0 NaN  29.0  30.0 NaN NaN  15.0  16.0  17.0 NaN  28.0  30.0   

    14    15    16  
80 NaN  28.0  18.0  


df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan,  side='right')
print (df)
   name   1   2   3   4   5     6     7     8     9    10    11    12    13  \
80  bob NaN NaN NaN NaN NaN  27.0  29.0  29.0  30.0  15.0  16.0  17.0  28.0   

      14    15    16  
80  30.0  28.0  18.0

功能：

#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

性能：

#100 rows
df = pd.concat([df] * 100, ignore_index=True)

#41 times slowier
In [39]: %timeit df.loc[:,df.columns[1:]] =  df.loc[:,df.columns[1:]].apply(fun, axis=1)
145 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [41]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan,  side='right')
3.54 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#1000 rows
df = pd.concat([df] * 1000, ignore_index=True)

#198 times slowier
In [43]: %timeit df.loc[:,df.columns[1:]] =  df.loc[:,df.columns[1:]].apply(fun, axis=1)
1.13 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [45]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan,  side='right')
5.7 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

用数据填充上一列中的NaN值

2 个答案: