我需要用上一行中的值替换NaN,但第一行中的NaN值替换为零。什么是最有效的解决方案?
样本输入,输出-
In [179]: arr
Out[179]:
array([[ 5., nan, nan, 7., 2., 6., 5.],
[ 3., nan, 1., 8., nan, 5., nan],
[ 4., 9., 6., nan, nan, nan, 7.]])
In [180]: out
Out[180]:
array([[ 5., 0, 0., 7., 2., 6., 5.],
[ 3., 0, 1., 8., 2., 5., 5.],
[ 4., 9., 6., 8., 2., 6., 7.]])
答案 0 :(得分:2)
(编辑以包含(部分?)矢量化方法)
( EDIT2 包括一些时间)
与所需输入/输出匹配的最简单解决方案是遍历各行:
import numpy as np
def ffill_loop(arr, fill=0):
mask = np.isnan(arr[0])
arr[0][mask] = fill
for i in range(1, len(arr)):
mask = np.isnan(arr[i])
arr[i][mask] = arr[i - 1][mask]
return arr
print(ffill_loop(arr.copy()))
# [[5. 0. 0. 7. 2. 6. 5.]
# [3. 0. 1. 8. 2. 5. 5.]
# [4. 9. 6. 8. 2. 5. 7.]]
您还可以使用向量化方法,这种方法对于较大的输入可能会更快(彼此下方的nan
越少越好):
import numpy as np
def ffill_roll(arr, fill=0, axis=0):
mask = np.isnan(arr)
replaces = np.roll(arr, 1, axis)
slicing = tuple(0 if i == axis else slice(None) for i in range(arr.ndim))
replaces[slicing] = fill
while np.count_nonzero(mask) > 0:
arr[mask] = replaces[mask]
mask = np.isnan(arr)
replaces = np.roll(replaces, 1, axis)
return arr
print(ffill_roll(arr.copy()))
# [[5. 0. 0. 7. 2. 6. 5.]
# [3. 0. 1. 8. 2. 5. 5.]
# [4. 9. 6. 8. 2. 5. 7.]]
为这些功能计时(包括@Divakar's answer中提出的无环解决方案):
import numpy as np
from numpy import nan
funcs = ffill_loop, ffill_roll, ffill_cols
sep = ' ' * 4
print(f'{"shape":15s}', end=sep)
for func in funcs:
print(f'{func.__name__:>15s}', end=sep)
print()
for n in (1, 5, 10, 50, 100, 500, 1000, 2000):
k = l = n
arr = np.array([[ 5., nan, nan, 7., 2., 6., 5.] * k,
[ 3., nan, 1., 8., nan, 5., nan] * k,
[ 4., 9., 6., nan, nan, nan, 7.] * k] * l)
print(f'{arr.shape!s:15s}', end=sep)
for func in funcs:
result = %timeit -q -o func(arr.copy())
print(f'{result.best * 1e3:12.3f} ms', end=sep)
print()
shape ffill_loop ffill_roll ffill_cols
(3, 7) 0.009 ms 0.063 ms 0.026 ms
(15, 35) 0.043 ms 0.074 ms 0.034 ms
(30, 70) 0.092 ms 0.098 ms 0.055 ms
(150, 350) 0.783 ms 0.939 ms 0.786 ms
(300, 700) 2.409 ms 4.060 ms 3.829 ms
(1500, 3500) 49.447 ms 105.379 ms 169.649 ms
(3000, 7000) 169.799 ms 340.548 ms 759.854 ms
(6000, 14000) 656.982 ms 1369.651 ms 1610.094 ms
多数情况下,对于给定的输入,表明ffill_loop()
实际上是最快的。相反,随着输入大小的增加,ffill_cols()
逐渐成为最慢的方法。
答案 1 :(得分:2)
这是一个基于矢量NumPy的游戏,灵感来自Most efficient way to forward-fill NaN values in numpy array's answer post
-
def ffill_cols(a, startfillval=0):
mask = np.isnan(a)
tmp = a[0].copy()
a[0][mask[0]] = startfillval
mask[0] = False
idx = np.where(~mask,np.arange(mask.shape[0])[:,None],0)
out = np.take_along_axis(a,np.maximum.accumulate(idx,axis=0),axis=0)
a[0] = tmp
return out
样品运行-
In [2]: a
Out[2]:
array([[ 5., nan, nan, 7., 2., 6., 5.],
[ 3., nan, 1., 8., nan, 5., nan],
[ 4., 9., 6., nan, nan, nan, 7.]])
In [3]: ffill_cols(a)
Out[3]:
array([[5., 0., 0., 7., 2., 6., 5.],
[3., 0., 1., 8., 2., 5., 5.],
[4., 9., 6., 8., 2., 5., 7.]])
答案 2 :(得分:1)
import numpy as np
arr = np.array([[ 5., np.nan, np.nan, 7., 2., 6., 5.],
[ 3., np.nan, 1., 8., np.nan, 5., np.nan],
[ 4., 9., 6., np.nan, np.nan, np.nan, 7.]])
nan_indices = np.isnan(arr)
nan_indices给您的地方:
array([[False, True, True, False, False, False, False],
[False, True, False, False, True, False, True],
[False, False, False, True, True, True, False]])
现在,只需使用您在问题中提到的逻辑替换值即可。
arr[0, nan_indices[0, :]] = 0
for row in range(1, np.shape(arr)[0]):
arr[row, nan_indices[row, :]] = arr[row - 1, nan_indices[row, :]]
现在arr是:
array([[5., 0., 0., 7., 2., 6., 5.],
[3., 0., 1., 8., 2., 5., 5.],
[4., 9., 6., 8., 2., 5., 7.]])
答案 3 :(得分:0)
怎么样?
import numpy as np
x = np.array([[ 5., np.nan, np.nan, 7., 2., 6., 5.],
[ 3., np.nan, 1., 8., np.nan, 5., np.nan],
[ 4., 9., 6., np.nan, np.nan, np.nan, 7.]])
def fillnans(a):
a[0, np.isnan(a[0,:])] = 0
while np.any(np.isnan(a)):
a[np.isnan(a)] = np.roll(a, 1, 0)[np.isnan(a)]
return a
print(x)
print(fillnans(x))
[[ 5. nan nan 7. 2. 6. 5.]
[ 3. nan 1. 8. nan 5. nan]
[ 4. 9. 6. nan nan nan 7.]]
[[5. 0. 0. 7. 2. 6. 5.]
[3. 0. 1. 8. 2. 5. 5.]
[4. 9. 6. 8. 2. 5. 7.]]
我希望这会有所帮助!
答案 4 :(得分:0)
library(dplyr)
library(ggalt)
library(ggplot2)
library(tidyverse)
library(ggalt)
library(ggrepel)
library(RColorBrewer)
df<-read.csv("..Median_age.csv",stringsAsFactors = F)
limit<-10E6 #10 mill
#Drop unknown HMTC regions and those with more than 10 mill
df <- df %>% filter(!HMTC.Region=="N/A") %>% filter(Population_2020>=limit & Population_2030>=limit)
df$Country<-str_trim(df$Country)
df<-df %>% arrange(Median.Age.2020,Country) %>% mutate(id=row_number())
df_plot <- structure(list(
Country = df$Country,
HMTC.Region = df$HMTC.Region,
Median.Age.2020 = df$Median.Age.2020,
Median.age.2025 = df$Median.age.2025,
Median.Age.2030 = df$Median.Age.2030),
class = "data.frame", row.names = c(NA,-72L)) #72 is length of dataframe?
df_plot$color<-"#4D4D4D"
df_plot<-df_plot %>% mutate(color=ifelse(HMTC.Region=="Africa","#00B0F0",color)) %>%
mutate(color=ifelse(HMTC.Region=="Europe","#FFBD33",color)) %>%
mutate(color=ifelse(HMTC.Region=="Asia Pacific ","#7030A0",color)) %>%
mutate(color=ifelse(HMTC.Region=="Eastern Europe & Central Asia","#5F589E",color)) %>%
mutate(color=ifelse(HMTC.Region=="South Asia","#F6003B",color)) %>%
mutate(color=ifelse(HMTC.Region=="Latin America","#82C836",color)) %>%
mutate(color=ifelse(HMTC.Region=="Middle East","#A19F57",color)) %>%
mutate(color=ifelse(HMTC.Region=="North America","#002060",color))
# df_plot <- df_plot %>% arrange(Median.Age.2020)
ggplot(df_plot,aes(x=Median.Age.2020,xend=Median.Age.2030,y=reorder(Country,Median.Age.2020))) +
geom_dumbbell( size=1.4,color="#5E5E5E",
colour_x = "#6E0019", colour_xend = "#FF5179",
dot_guide = T,
dot_guide_size = 0.4,
size_x=3.5,
size_xend=3.5)+
geom_point(aes(x=Median.age.2025,y=Country),size=3.5,color="#A50026")+
xlab("Median Age")+
xlim(14,70) +
theme_classic()+
theme(axis.text.y = element_text(colour=df_plot$color),
axis.ticks.y=element_blank())+
ylab("")
在第一行中用零替换nan
from numpy import *
a = array([[5., nan, nan, 7., 2., 6., 5.],
[3., nan, 1., 8., nan, 5., nan],
[4., 9., 6., nan, nan, nan, 7.]])
将nan替换为其他行
where_are_NaNs = isnan(a[0])
a[0][where_are_NaNs] = 0