我有一个需要重新编码的大型数据集。数据集的每一行都是按时间顺序(时间)从单独的实验(id)中可能检测到的。然后手动验证每个可能的检测。当进行第一次真检测时,将其标记为(注释)“第一”,而进行最后一次真检测时,则将其标记为“最后”。如果没有检测到,则输入“无”。
我正在使用 if 语句进行重新编码。 1)首先我想为变量id选择所有情况,其中first和last都存在,然后它需要用'no_comment'填充first和last之间的所有内容,然后它需要填充first和last之前或之后的所有内容与“MVND”。 2)选择仅存在“none”的id案例并在该id案例的所有行中填充“none”。各行代码都在工作,但由于某种原因,当我将它们组合在 ddply 中的 if 语句中时,它们不能一起工作 - 它们只返回原始 data.frame。我认为我的 if else 结构是错误的。
#approximate data structure for this case:
y <-data.frame(id=c(rep("a",10),rep("b",10),rep("c",10)),time=rep(1:10, 3), Comments=rep(NA,30))
y$Comments[c(2,11,23)]<-"first"
y$Comments[c(9,19,30)]<-"last"
#x=y[y$id=="a",] #testing specific lines
#recursive process to step through the data
ddply(y,.(id), .fUN=function(x){
if(all(unique(na.omit(x$Comments))%in%c("first","last"))){
f<-which(x$Comments == "first")
l<-which(x$Comments == "last")
#Add no comment to all records between first and last
x$Comments[(f+1):(l - 1)]<- "no_comment"
#if 'first' isn't the first record add MVND to all things before 'first'
if(f>1){x$Comments[1:(f-1)]<-"MVND"}
#if 'last' isn't the last record add MVND to all records after 'last'.
if(l<nrow[x]){x$Comments[(l+1):nrow(x)]<-"MVND"}
}else if(unique(na.omit(x$Comments))=="none"){
x$Comments<-"none" #if the only unique comment is "none" set all comments to none
}
}
)
如果数据表是一种更好的方法来做到这一点,我很想在 dt 中找出如何做到这一点。
#Edit:上述内容经过修改,以扩展我正在处理的“第一个/最后一个”和“无”两种情况。 Jon spring 的解决方案非常适用于我最初发布仅包含第一个/最后一个案例的示例数据的方式。
答案 0 :(得分:1)
不确定是否对您有用,但这是我在 dplyr
中的处理方式。由于这是矢量化的,我希望它比基于循环的方法运行得更快。
library(dplyr)
y %>%
group_by(id) %>%
dplyr::mutate(Comments2 = case_when( # in case `plyr` is loaded
cumsum(coalesce(lag(Comments == "last"), FALSE)) >= 1 ~ "MVND",
cumsum(coalesce(Comments == "first", FALSE)) < 1 ~ "MVND",
is.na(Comments) ~ "no_comment",
TRUE ~ Comments)) %>%
ungroup()
这里的棘手部分是 MVND 书挡,我会计算我们是否已通过 last
或尚未到达 first
。 coalesce
将第一项中的任何 NA 转换为第二项中的 FALSE
值。 cumsum
此处将 TRUE
值相加。
这是我得到的结果,使用 datapasta
作为小块粘贴。据我所知,输出看起来符合预期:
tibble::tribble(
~id, ~time, ~Comments, ~Comments2,
"a", 1L, NA, "MVND",
"a", 2L, "first", "first",
"a", 3L, NA, "no_comment",
"a", 4L, NA, "no_comment",
"a", 5L, NA, "no_comment",
"a", 6L, NA, "no_comment",
"a", 7L, NA, "no_comment",
"a", 8L, NA, "no_comment",
"a", 9L, "last", "last",
"a", 10L, NA, "MVND",
"b", 1L, "first", "first",
"b", 2L, NA, "no_comment",
"b", 3L, NA, "no_comment",
"b", 4L, NA, "no_comment",
"b", 5L, NA, "no_comment",
"b", 6L, NA, "no_comment",
"b", 7L, NA, "no_comment",
"b", 8L, NA, "no_comment",
"b", 9L, "last", "last",
"b", 10L, NA, "MVND",
"c", 1L, NA, "MVND",
"c", 2L, NA, "MVND",
"c", 3L, "first", "first",
"c", 4L, NA, "no_comment",
"c", 5L, NA, "no_comment",
"c", 6L, NA, "no_comment",
"c", 7L, NA, "no_comment",
"c", 8L, NA, "no_comment",
"c", 9L, NA, "no_comment",
"c", 10L, "last", "last"
)
答案 1 :(得分:1)
对于这项任务,我的首选方法是 data.table
有两个原因:
为了涵盖 OP 提到的所有用例,我们需要创建一个增强的样本数据集
y <- data.frame(
id = rep(letters[1:5], each = 5L),
time = rep(1:5, 5L),
Comments = rep(NA_character_, 25L))
y$Comments[c(2, 6, 13, 22)] <- "first"
y$Comments[c(4, 9, 15, 23)] <- "last"
y$Comments[c(18)] <- "none"
y
<块引用>
id time Comments
1 a 1 <NA>
2 a 2 first
3 a 3 <NA>
4 a 4 last
5 a 5 <NA>
6 b 1 first
7 b 2 <NA>
8 b 3 <NA>
9 b 4 last
10 b 5 <NA>
11 c 1 <NA>
12 c 2 <NA>
13 c 3 first
14 c 4 <NA>
15 c 5 last
16 d 1 <NA>
17 d 2 <NA>
18 d 3 none
19 d 4 <NA>
20 d 5 <NA>
21 e 1 <NA>
22 e 2 first
23 e 3 last
24 e 4 <NA>
25 e 5 <NA>
现在,我们可以插入缺失的 Comments
library(data.table)
y <- setDT(copy(y))
# copy "none" to all rows of the id group in case one Comment is "none"
y[, Comments := if (isTRUE(any(Comments == "none"))) "none" , by = id][]
# create look-up table
lut <- dcast(y[which(Comments %in% c("first", "last"))], id ~ Comments, value.var = "time")
# update in non-equi joins
y[lut, on = .(id, time < first), Comments := "MVND"][]
y[lut, on = .(id, time > last), Comments := "MVND"][]
y[lut, on = .(id, time > first, time < last), Comments := "no commments"][]
<块引用>
id time Comments
1: a 1 MVND
2: a 2 first
3: a 3 no commments
4: a 4 last
5: a 5 MVND
6: b 1 first
7: b 2 no commments
8: b 3 no commments
9: b 4 last
10: b 5 MVND
11: c 1 MVND
12: c 2 MVND
13: c 3 first
14: c 4 no commments
15: c 5 last
16: d 1 none
17: d 2 none
18: d 3 none
19: d 4 none
20: d 5 none
21: e 1 MVND
22: e 2 first
23: e 3 last
24: e 4 MVND
25: e 5 MVND
id time Comments
查找表 lut
包含 first
和 last
出现的次数,分别为:
id first last
1: a 2 4
2: b 1 4
3: c 3 5
4: e 2 3
请注意,我们假设生产数据集“表现良好”,即
id
组都包含 "none"
"first"
列中正好是一对 "last"
和 Comments
"first"
总是出现在 "last"
之前。