我对一个更普遍的问题的特定变化有点困惑。我有与data.table一起使用的面板数据,我想使用data.table的功能组填写一些缺失值。不幸的是,它们不是数字,所以我不能简单地进行插值,但它们只能根据条件填充。是否可以在data.tables中执行一种有条件的na.locf?
基本上我只想填写NA,如果在NA之后下一个观察是先前的观察,尽管更一般的问题是如何有条件地填写NA。
例如,在以下数据中,我想通过每个id组填写associatedid变量。所以id == 1,year == 2003将填入ABC123,因为它是NA之前和之后的值,但不是2000的相同id。 Id == 2不会被更改,因为下一个值与NA之前的值不同。 Id == 3将填写2003年和2004年。
df = read.table(header=T, text = "id year associatedid
1 2000 NA
1 2001 ABC123
1 2002 ABC123
1 2003 NA
1 2004 ABC123
1 2005 ABC123
2 2000 NA
2 2001 ABC123
2 2002 ABC123
2 2003 NA
2 2004 DEF456
2 2005 DEF456
3 2000 NA
3 2001 ABC123
3 2002 ABC123
3 2003 NA
3 2004 NA
3 2005 ABC123
")
dt = data.table(df,key = c("id"))
非常感谢任何建议或意见。谢谢!
答案 0 :(得分:7)
如果向前和向后应用的na.locf0
相同,则使用na.locf0
;否则,如果它们不相等或两者都不是,则使用NA。
library(data.table)
library(zoo)
dt[, associatedid :=
ifelse(na.locf0(associatedid) == na.locf0(associatedid, fromLast=TRUE),
na.locf0(associatedid), NA), by = id]
给予:
> dt
id year associatedid
1: 1 2000 <NA>
2: 1 2001 ABC123
3: 1 2002 ABC123
4: 1 2003 ABC123
5: 1 2004 ABC123
6: 1 2005 ABC123
7: 2 2000 <NA>
8: 2 2001 ABC123
9: 2 2002 ABC123
10: 2 2003 <NA>
11: 2 2004 DEF456
12: 2 2005 DEF456
13: 3 2000 <NA>
14: 3 2001 ABC123
15: 3 2002 ABC123
16: 3 2003 ABC123
17: 3 2004 ABC123
18: 3 2005 ABC123
答案 1 :(得分:4)
这是一个纯粹的
library(tidyverse)
mydf %>%
mutate(up = associatedid, down = associatedid) %>%
group_by(id) %>%
fill(up,.direction = "up") %>%
fill(down) %>%
mutate_at("associatedid", ~if_else(is.na(.) & up == down, up, .)) %>%
ungroup() %>%
select(-up, - down)
#> # A tibble: 18 x 3
#> id year associatedid
#> <int> <int> <fct>
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
或使用zoo::na.locf
:
library(dplyr)
library(zoo)
mydf %>%
group_by(id) %>%
mutate_at("associatedid", ~if_else(
is.na(.) & na.locf(.,F) == na.locf(.,F,fromLast = TRUE), na.locf(.,F), .)) %>%
ungroup()
#> # A tibble: 18 x 3
#> id year associatedid
#> <int> <int> <fct>
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
相同的想法,但是使用data.table:
library(zoo)
library(data.table)
setDT(mydf)
mydf[,associatedid := fifelse(
is.na(associatedid) & na.locf(associatedid,F) == na.locf(associatedid,F,fromLast = TRUE),
na.locf(associatedid,F), associatedid),
by = id]
mydf
#> id year associatedid
#> 1: 1 2000 <NA>
#> 2: 1 2001 ABC123
#> 3: 1 2002 ABC123
#> 4: 1 2003 ABC123
#> 5: 1 2004 ABC123
#> 6: 1 2005 ABC123
#> 7: 2 2000 <NA>
#> 8: 2 2001 ABC123
#> 9: 2 2002 ABC123
#> 10: 2 2003 <NA>
#> 11: 2 2004 DEF456
#> 12: 2 2005 DEF456
#> 13: 3 2000 <NA>
#> 14: 3 2001 ABC123
#> 15: 3 2002 ABC123
#> 16: 3 2003 ABC123
#> 17: 3 2004 ABC123
#> 18: 3 2005 ABC123
最后是一个使用base的有趣主意,请注意,如果此字符变量为数字,则仅当常数插值和线性插值相同时才想插值:
i <- ave( as.numeric(factor(mydf$associatedid)), mydf$id,FUN = function(x) ifelse(
approx(x,xout = seq_along(x))$y == (z<- approx(x,xout = seq_along(x),method = "constant")$y),
z, x))
mydf$associatedid <- levels(mydf$associatedid)[i]
mydf
#> id year associatedid
#> 1 1 2000 <NA>
#> 2 1 2001 ABC123
#> 3 1 2002 ABC123
#> 4 1 2003 ABC123
#> 5 1 2004 ABC123
#> 6 1 2005 ABC123
#> 7 2 2000 <NA>
#> 8 2 2001 ABC123
#> 9 2 2002 ABC123
#> 10 2 2003 <NA>
#> 11 2 2004 DEF456
#> 12 2 2005 DEF456
#> 13 3 2000 <NA>
#> 14 3 2001 ABC123
#> 15 3 2002 ABC123
#> 16 3 2003 ABC123
#> 17 3 2004 ABC123
#> 18 3 2005 ABC123
答案 2 :(得分:3)
您可以向前和向后滚动缺少的行,比较值并指定它们是否相等:
dat/2darrtags.txt
答案 3 :(得分:2)
这就是编写修改后的na.locf函数。之后,您可以像任何其他函数一样将其插入data.table。
new.locf <- function(x){
# might want to think about the end of this loop
# this works here but you might need to add another case
# if there are NA's as the last value.
#
# anyway, loop through observations in a vector, x.
for(i in 2:(length(x)-1)){
nextval = i
# find the next, non-NA value
# again, not tested but might break if there isn't one?
while(nextval <= length(x)-1 & is.na(x[nextval])){
nextval = nextval + 1
}
# if the current value is not NA, great!
if(!is.na(x[i])){
x[i] <- x[i]
}else{
# if the current value is NA, and the last value is a value
# (should given the nature of this loop), and
# the next value, as calculated above, is the same as the last
# value, then give us that value.
if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
x[i] <- x[nextval]
}else{
# finally, return NA if neither of these conditions hold
x[i] <- NA
}
}
}
# return the new vector
return(x)
}
一旦我们拥有了这个功能,我们就像往常一样使用data.table:
dt2 <- dt[,list(year = year,
# when I read your data in, associatedid read as factor
associatedid = new.locf(as.character(associatedid))
),
by = "id"
]
返回:
> dt2
id year associatedid
1: 1 2000 NA
2: 1 2001 ABC123
3: 1 2002 ABC123
4: 1 2003 ABC123
5: 1 2004 ABC123
6: 1 2005 ABC123
7: 2 2000 NA
8: 2 2001 ABC123
9: 2 2002 ABC123
10: 2 2003 NA
11: 2 2004 DEF456
12: 2 2005 DEF456
13: 3 2000 NA
14: 3 2001 ABC123
15: 3 2002 ABC123
16: 3 2003 ABC123
17: 3 2004 ABC123
18: 3 2005 ABC123
这是我所理解的最好的东西。
我在new.locf定义中提供了一些对冲,所以你可能还有一些想法,但这应该让你开始。
答案 4 :(得分:1)
这是dplyr
的另一尝试:
library(dplyr)
mydf %>%
#Detect NA values in associatedid
mutate(isReplaced = is.na(associatedid), ans = associatedid) %>%
group_by(id) %>%
#Fill all NA values
tidyr::fill(associatedid) %>%
#Detect the NA values which were replaced
mutate(isReplaced = isReplaced & !is.na(associatedid)) %>%
#Group by id and associatedid
group_by(associatedid, add = TRUE) %>%
#Add NA values if it was isReplaced and is first or last row of the group
mutate(ans = replace(associatedid,row_number() %in% c(1, n()) & isReplaced, NA)) %>%
ungroup() %>%
select(-isReplaced, -associatedid)
# A tibble: 18 x 3
# id year ans
# <int> <int> <fct>
# 1 1 2000 NA
# 2 1 2001 ABC123
# 3 1 2002 ABC123
# 4 1 2003 ABC123
# 5 1 2004 ABC123
# 6 1 2005 ABC123
# 7 2 2000 NA
# 8 2 2001 ABC123
# 9 2 2002 ABC123
#10 2 2003 NA
#11 2 2004 DEF456
#12 2 2005 DEF456
#13 3 2000 NA
#14 3 2001 ABC123
#15 3 2002 ABC123
#16 3 2003 ABC123
#17 3 2004 ABC123
#18 3 2005 ABC123
答案 5 :(得分:0)
我一直在尝试将两次传递方法放在一起,在第一次传递时将NA更改为在起始值(在id内)中粘贴“p_”然后使用第二次传递检查最后一次一个序列与下一个实际值一致。到目前为止,我提供了我的代码,这不是一个真正的答案,所以不要期待任何赞成。 (可能更容易将associatedid
重命名为asid
。)
lapply( split(df, df$id),
function(d){ d$associatedid <- as.character(d$associatedid)
missloc <- with( d, tapply(is.na(associatedid), id, which))
for (n in missloc) if(
d$associatedid[n+1] %in% c(d$associatedid[n-1],
paste0("p_" , d$associatedid[n-1])&
grepl( gsub("p\\_", "", d$associatedid[n-1]), d$associatedid[n+1] )
{ d$associatedid[n] <- d$associatedid[n-1]
} else{
#tentative NA replacement
d$associatedid[n] <- paste0("p_" , d$associatedid[n-1])}
})