简单的玩具数据框:
df = pd.DataFrame({'mycol':['foo','bar','hello','there',np.nan,np.nan,np.nan,'foo'],
'mycol2':'this is here to make it a DF'.split()})
print(df)
mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 NaN make
5 NaN it
6 NaN a
7 foo DF
我正在尝试用来自其自身的样本(例如,mycol
)填充NaN。我希望用foo
,bar
,hello
等样本替换NaN。
# fill NA values with n samples (n= number of NAs) from df['mycol']
df['mycol'].fillna(df['mycol'].sample(n=df.isna().sum(), random_state=1,replace=True).values)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# fill NA values with n samples, n=1. Dropna from df['mycol'] before sampling:
df['mycol'] = df['mycol'].fillna(df['mycol'].dropna().sample(n=1, random_state=1,replace=True)).values
# nothing happens
预期的输出:Nas中填充了mycol
中的随机样本:
mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 foo make
5 foo it
6 hello a
7 foo DF
编辑答案: @Jezrael在下面的答案将其排序,我的索引有问题。
df['mycol'] = (df['mycol']
.dropna()
.sample(n=len(df),replace=True)
.reset_index(drop=True))
答案 0 :(得分:3)
有趣的问题。
对于我来说,使用loc
来设置值并将值转换为numpy数组以避免数据对齐:
a = df['mycol'].dropna().sample(n=df['mycol'].isna().sum(), random_state=1,replace=True)
print (a)
3 there
7 foo
0 foo
Name: mycol, dtype: object
#pandas 0.24+
df.loc[df['mycol'].isna(), 'mycol'] = a.to_numpy()
#pandas below
#df.loc[df['mycol'].isna(), 'mycol'] = a.values
print (df)
mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 there make
5 foo it
6 foo a
7 foo DF
如果Series和index的长度与原始DataFrame
相同,则您的解决方案应该可以工作:
s = df['mycol'].dropna().sample(n=len(df), random_state=1,replace=True)
s.index = df.index
print (s)
0 there
1 foo
2 foo
3 bar
4 there
5 foo
6 foo
7 bar
Name: mycol, dtype: object
df['mycol'] = df['mycol'].fillna(s)
print (df)
# mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 there make
5 foo it
6 foo a
7 foo DF
答案 1 :(得分:1)
您可以向前或向后填充:
#backward fill
df['mycol'] = df['mycol'].fillna(method='bfill')
#forward Fill
df['mycol'] = df['mycol'].fillna(method='ffill')