如何用剩余的行替换第一行的0?
import numpy as np
from sklearn.impute import SimpleImputer
data = np.array([[0,0,0,0,3,2,4,4,0],
[4,6,8,9,3,1,1,4,0],
[4,6,8,9,3,1,1,4,0]])
print (data.shape)
imputer = SimpleImputer(missing_values=0, strategy='mean')
res = imputer.fit_transform(data)
print (res)
[[4. 6. 8. 9. 3. 2. 4. 4.]
[4. 6. 8. 9. 3. 1. 1. 4.]
[4. 6. 8. 9. 3. 1. 1. 4.]]
但是,不应删除任何列。
预期结果是:
[[4. 6. 8. 9. 3. 2. 4. 4. 0]
[4. 6. 8. 9. 3. 1. 1. 4. 0]
[4. 6. 8. 9. 3. 1. 1. 4. 0]]
有什么主意吗?
答案 0 :(得分:2)
只需建立索引即可满足您的需求:
m = data[0] == 0
data[0, m] = data[1:,m].mean(0)
print(data)
array([[4, 6, 8, 9, 3, 2, 4, 4, 0],
[4, 6, 8, 9, 3, 1, 1, 4, 0],
[4, 6, 8, 9, 3, 1, 1, 4, 0]])
要使用其他所有行的平均值填充所有零,并且用平均值排除零,我们可以使用掩码数组:
m = data == 0
means = np.ma.array(data, mask = m).mean(0)
data + m * means.data
array([[4., 6., 8., 9., 3., 2., 4., 4., 0.],
[4., 6., 8., 9., 3., 1., 1., 4., 0.],
[4., 6., 8., 9., 3., 1., 1., 4., 0.]])
更新
要填充其他列的平均值,您可以类似地执行以下操作:
m = data == 0
means = np.ma.array(data, mask = m).mean(1)
data + m * means.data[:,None]
array([[3.25, 3.25, 3.25, 3.25, 3. , 2. , 4. , 4. , 3.25],
[4. , 6. , 8. , 9. , 3. , 1. , 1. , 4. , 4.5 ],
[4. , 6. , 8. , 9. , 3. , 1. , 1. , 4. , 4.5 ]])
答案 1 :(得分:1)
使用axis
参数的一种方法是沿通用轴应用它-
def fill0s(data, axis):
m = data!=0
s = data.sum(axis, keepdims=True)
c = m.sum(axis, keepdims=True)
c[c==0] = 1 # to avoid warning of division by 0
return np.where(m,data,s/c)
样品运行-
In [143]: data
Out[143]:
array([[0, 0, 0, 0, 3, 2, 4, 4, 0],
[4, 6, 8, 9, 3, 1, 1, 4, 0],
[6, 6, 8, 9, 3, 1, 1, 4, 0],
[0, 6, 8, 9, 3, 1, 1, 4, 0]])
In [144]: fill0s(data,axis=0)
Out[144]:
array([[5., 6., 8., 9., 3., 2., 4., 4., 0.],
[4., 6., 8., 9., 3., 1., 1., 4., 0.],
[6., 6., 8., 9., 3., 1., 1., 4., 0.],
[5., 6., 8., 9., 3., 1., 1., 4., 0.]])
In [147]: fill0s(data,axis=1)
Out[147]:
array([[3.25, 3.25, 3.25, 3.25, 3. , 2. , 4. , 4. , 3.25],
[4. , 6. , 8. , 9. , 3. , 1. , 1. , 4. , 4.5 ],
[6. , 6. , 8. , 9. , 3. , 1. , 1. , 4. , 4.75],
[4.57, 6. , 8. , 9. , 3. , 1. , 1. , 4. , 4.57]])
更大数据集上的时间-
In [150]: np.random.seed(0)
In [151]: data = np.random.randint(0,10,(5000,5000))
In [152]: %timeit fill0s(data,axis=0)
161 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [153]: %timeit fill0s(data,axis=1)
155 ms ± 6.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#@yatu's solution
In [155]: %%timeit
...: m = data == 0
...: means = np.ma.array(data, mask = m).mean(0)
...: data + m * means.data
302 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [156]: %%timeit
...: m = data == 0
...: means = np.ma.array(data, mask = m).mean(1)
...: data + m * means.data[:,None]
291 ms ± 2.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)