我有一个看起来像这样的数据框
ID Name Surname Country Unique_number
1 John Snow UK 12345
1 John Anderson USA 53214
1 John David UK NA
2 Kim Snow UK 62321
2 Kim Anderson USA 77832
2 Kim David UK NA
我想要一个看起来像这样的数据(请注意unique_number的更改)
ID Name Surname Country Unique_number
1 John Snow UK 12345
1 John Anderson USA 53214
1 John David UK 12345
2 Kim Snow UK 62321
2 Kim Anderson USA 77832
2 Kim David UK 62321
有人可以帮忙在R Studios中做到这一点吗?
谢谢
答案 0 :(得分:0)
df$Unique_number[2]
的值是什么,它是一个空字符串吗?您可以先将其转换为NA
df$Unique_number[df$Unique_number == ''] <- NA
,然后使用na.locf
软件包中的zoo
df$Unique_number <- zoo::na.locf(df$Unique_number)
这将保留最后的非NA观测值以替换NA。
编辑
要预设原始的NA
值,请将数据框一分为二,并仅在包含要替换的值的部分上操作(我假设是空字符串)
df0 = df[is.na(df$Unique_number), ]
df1 = df[-is.na(df$Unique_number), ]
(或者使用split(df, is.na(df$Unique_number)
),然后在df1
上调用上面的代码,最后在rbind
上调用它们。
编辑2
这是另一种方法,我敢肯定,它会比上面使用zoo
的方法要慢,但可以让您指定自己的字符串
MISSING_STRING = '' # String you want replaced with last non-NA value
x0 <- c("1", "2", "", "3", "4", "", "", "5", "6", NA, "", "7", "8",
"", "9", "10", "") # Example vector
x <- x0 # Store initial example vector for comparison at the end
missing.ids <- which(is.na(x) | x == MISSING_STRING)
replacement.ids <- missing.ids - 1
replacement.ids[1 + which(diff(replacement.ids) == 1)] <- replacement.ids[diff(replacement.ids) == 1]
na.ids <- is.na(x)
x[missing.ids] <- x[replacement.ids]
x[na.ids] <- NA
# Compare initial vs final value
cbind(x0, x)
x0 x
[1,] "1" "1"
[2,] "2" "2"
[3,] "" "2"
[4,] "3" "3"
[5,] "4" "4"
[6,] "" "4"
[7,] "" "4"
[8,] "5" "5"
[9,] "6" "6"
[10,] NA NA
[11,] "" "6"
[12,] "7" "7"
[13,] "8" "8"
[14,] "" "8"
[15,] "9" "9"
[16,] "10" "10"
[17,] "" "10"
答案 1 :(得分:0)
使用fill
中的tidyr
:
library(dplyr)
library(tidyr)
df %>%
group_by(Name, Country) %>%
fill(Unique_number)
输出:
# A tibble: 6 x 5
# Groups: Name, Country [4]
ID Name Surname Country Unique_number
<int> <fct> <fct> <fct> <int>
1 1 John Snow UK 12345
2 1 John David UK 12345
3 1 John Anderson USA 53214
4 2 Kim Snow UK 62321
5 2 Kim David UK 62321
6 2 Kim Anderson USA 77832
数据:
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), Name = structure(c(1L,
1L, 1L, 2L, 2L, 2L), .Label = c("John", "Kim"), class = "factor"),
Surname = structure(c(3L, 1L, 2L, 3L, 1L, 2L), .Label = c("Anderson",
"David", "Snow"), class = "factor"), Country = structure(c(1L,
2L, 1L, 1L, 2L, 1L), .Label = c("UK", "USA"), class = "factor"),
Unique_number = c(12345L, 53214L, NA, 62321L, 77832L, NA)), .Names = c("ID",
"Name", "Surname", "Country", "Unique_number"), class = "data.frame", row.names = c(NA,
-6L))