我有一个具有某些NaN值的df。例如,这是df:
import numpy as np
import pandas as pd
np.random.seed(100)
data = np.random.rand(10,3)
data[3,0] = np.NaN
data[6,0] = np.NaN
data[5,1] = np.NaN
data[7,1] = np.NaN
data[1,2] = np.NaN
data[8,2] = np.NaN
data[6,2] = np.NaN
df = pd.DataFrame(data)
df
这是运行上述代码的结果:
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 NaN
2 0.670749 0.825853 0.136707
3 NaN 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 NaN 0.171941
6 NaN 0.274074 NaN
7 0.940030 NaN 0.336112
8 0.175410 0.372832 NaN
9 0.252426 0.795663 0.015255
我想要的是NaN值用上限值和下限值的平均值填充,如下所示:
np.random.seed(100)
data = np.random.rand(10,3)
data[3,0] = (data[2,0] + data[4,0])/2
data[6,0] = (data[5,0] + data[7,0])/2
data[5,1] = (data[4,1] + data[6,1])/2
data[7,1] = (data[6,1] + data[8,1])/2
data[1,2] = (data[0,2] + data[2,2])/2
data[8,2] = (data[7,2] + data[9,2])/2
data[6,2] = (data[5,2] + data[7,2])/2
df = pd.DataFrame(data)
df
上面的代码结果是:
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 0.280612
2 0.670749 0.825853 0.136707
3 0.428039 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 0.191225 0.171941
6 0.959327 0.274074 0.254026
7 0.940030 0.323453 0.336112
8 0.175410 0.372832 0.175683
9 0.252426 0.795663 0.015255
如何在python中自动执行此操作?
答案 0 :(得分:3)
我认为DataFrame.interpolate
应该在这里有所帮助:
df1 = df.interpolate()
print (df1)
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 0.280612
2 0.670749 0.825853 0.136707
3 0.428039 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 0.191225 0.171941
6 0.959327 0.274074 0.254026
7 0.940030 0.323453 0.336112
8 0.175410 0.372832 0.175683
9 0.252426 0.795663 0.015255
如果有多个连续的NaN
interpolate
,则不会替换为mean
:
np.random.seed(100)
data = np.random.rand(10,3)
data[3,0] = np.NaN
data[6,0] = np.NaN
data[5,1] = np.NaN
data[7,1] = np.NaN
data[1,2] = np.NaN
data[2,2] = np.NaN
data[8,2] = np.NaN
data[6,2] = np.NaN
df = pd.DataFrame(data)
print (df)
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 NaN
2 0.670749 0.825853 NaN
3 NaN 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 NaN 0.171941
6 NaN 0.274074 NaN
7 0.940030 NaN 0.336112
8 0.175410 0.372832 NaN
df1 = df.interpolate()
print (df1)
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 0.352746
2 0.670749 0.825853 0.280974
3 0.428039 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 0.191225 0.171941
6 0.959327 0.274074 0.254026
7 0.940030 0.323453 0.336112
8 0.175410 0.372832 0.175683
9 0.252426 0.795663 0.015255
平均值的解决方案:
df2 = df.ffill().add(df.bfill()).div(2)
print (df2)
0 1 2
0 0.543405 0.278369 0.424518
1 0.844776 0.004719 0.316860
2 0.670749 0.825853 0.316860
3 0.428039 0.891322 0.209202
4 0.185328 0.108377 0.219697
5 0.978624 0.191225 0.171941
6 0.959327 0.274074 0.254026
7 0.940030 0.323453 0.336112
8 0.175410 0.372832 0.175683
9 0.252426 0.795663 0.015255
答案 1 :(得分:3)
根据您的规范使用插值法(距离索引行仅一处):
df.interpolate(method='index', limit=1)
或者直接使用combine_first
进行操作:
fills = 0.5 * (df.fillna(method='ffill', limit=1)
+ df.fillna(method='bfill', limit=1))
df.combine_first(fills)
答案 2 :(得分:0)
更准确地使用sklearn
from sklearn.preprocessing import Imputer
mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
mean_imputer = mean_imputer.fit(df)
imputed_df = mean_imputer.transform(df.values)
imputed_df
[0.54340494, 0.27836939, 0.42451759],
[0.84477613, 0.00471886, 0.21620453],
[0.67074908, 0.82585276, 0.13670659],
[0.5738436 , 0.89132195, 0.20920212],
[0.18532822, 0.10837689, 0.21969749],
[0.97862378, 0.44390102, 0.17194101],
[0.5738436 , 0.27407375, 0.21620453],
[0.94002982, 0.44390102, 0.33611195],
[0.17541045, 0.37283205, 0.21620453],
[0.25242635, 0.79566251, 0.01525497]]