Question

我有一个需要重新编码的大型数据集。数据集的每一行都是按时间顺序（时间）从单独的实验（id）中可能检测到的。然后手动验证每个可能的检测。当进行第一次真检测时，将其标记为（注释）“第一”，而进行最后一次真检测时，则将其标记为“最后”。如果没有检测到，则输入“无”。

我正在使用 if 语句进行重新编码。 1）首先我想为变量id选择所有情况，其中first和last都存在，然后它需要用'no_comment'填充first和last之间的所有内容，然后它需要填充first和last之前或之后的所有内容与“MVND”。 2）选择仅存在“none”的id案例并在该id案例的所有行中填充“none”。各行代码都在工作，但由于某种原因，当我将它们组合在 ddply 中的 if 语句中时，它们不能一起工作 - 它们只返回原始 data.frame。我认为我的 if else 结构是错误的。

#approximate data structure for this case:
y <-data.frame(id=c(rep("a",10),rep("b",10),rep("c",10)),time=rep(1:10, 3), Comments=rep(NA,30))
 y$Comments[c(2,11,23)]<-"first"
 y$Comments[c(9,19,30)]<-"last"
 #x=y[y$id=="a",] #testing specific lines
 
#recursive process to step through the data
 ddply(y,.(id), .fUN=function(x){
 if(all(unique(na.omit(x$Comments))%in%c("first","last"))){
  f<-which(x$Comments == "first")
  l<-which(x$Comments == "last")  
  #Add no comment to all records between first and last
   x$Comments[(f+1):(l - 1)]<- "no_comment"
      #if 'first' isn't the first record add MVND to all things before 'first'    
       if(f>1){x$Comments[1:(f-1)]<-"MVND"} 
      #if 'last' isn't the last record add MVND to all records after 'last'.
       if(l<nrow[x]){x$Comments[(l+1):nrow(x)]<-"MVND"} 
 }else if(unique(na.omit(x$Comments))=="none"){
    x$Comments<-"none" #if the only unique comment is "none" set all comments to none
}
 }
 )

如果数据表是一种更好的方法来做到这一点，我很想在 dt 中找出如何做到这一点。

#Edit：上述内容经过修改，以扩展我正在处理的“第一个/最后一个”和“无”两种情况。 Jon spring 的解决方案非常适用于我最初发布仅包含第一个/最后一个案例的示例数据的方式。

Answer 1

不确定是否对您有用，但这是我在 dplyr 中的处理方式。由于这是矢量化的，我希望它比基于循环的方法运行得更快。

library(dplyr)
y %>%
  group_by(id) %>%
  dplyr::mutate(Comments2 = case_when(     # in case `plyr` is loaded
    cumsum(coalesce(lag(Comments == "last"), FALSE)) >= 1 ~ "MVND",
    cumsum(coalesce(Comments == "first", FALSE)) < 1 ~ "MVND",
    is.na(Comments) ~ "no_comment",
    TRUE ~ Comments)) %>%
  ungroup()

这里的棘手部分是 MVND 书挡，我会计算我们是否已通过 last 或尚未到达 first。 coalesce 将第一项中的任何 NA 转换为第二项中的 FALSE 值。 cumsum 此处将 TRUE 值相加。

这是我得到的结果，使用 datapasta 作为小块粘贴。据我所知，输出看起来符合预期：

tibble::tribble(
  ~id, ~time, ~Comments,   ~Comments2,
  "a",    1L,        NA,       "MVND",
  "a",    2L,   "first",      "first",
  "a",    3L,        NA, "no_comment",
  "a",    4L,        NA, "no_comment",
  "a",    5L,        NA, "no_comment",
  "a",    6L,        NA, "no_comment",
  "a",    7L,        NA, "no_comment",
  "a",    8L,        NA, "no_comment",
  "a",    9L,    "last",       "last",
  "a",   10L,        NA,       "MVND",
  "b",    1L,   "first",      "first",
  "b",    2L,        NA, "no_comment",
  "b",    3L,        NA, "no_comment",
  "b",    4L,        NA, "no_comment",
  "b",    5L,        NA, "no_comment",
  "b",    6L,        NA, "no_comment",
  "b",    7L,        NA, "no_comment",
  "b",    8L,        NA, "no_comment",
  "b",    9L,    "last",       "last",
  "b",   10L,        NA,       "MVND",
  "c",    1L,        NA,       "MVND",
  "c",    2L,        NA,       "MVND",
  "c",    3L,   "first",      "first",
  "c",    4L,        NA, "no_comment",
  "c",    5L,        NA, "no_comment",
  "c",    6L,        NA, "no_comment",
  "c",    7L,        NA, "no_comment",
  "c",    8L,        NA, "no_comment",
  "c",    9L,        NA, "no_comment",
  "c",   10L,    "last",       "last"
  )

Answer 2

对于这项任务，我的首选方法是 data.table 有两个原因：

可以就地更新列的部分内容，即无需复制
我们可以使用查找表更新非对等连接

为了涵盖 OP 提到的所有用例，我们需要创建一个增强的样本数据集

y <- data.frame(
  id = rep(letters[1:5], each = 5L),
  time = rep(1:5, 5L),
  Comments = rep(NA_character_, 25L))
y$Comments[c(2, 6, 13, 22)] <- "first"
y$Comments[c(4, 9, 15, 23)] <- "last"
y$Comments[c(18)] <- "none"

y

<块引用>

   id time Comments
1   a    1     <NA>
2   a    2    first
3   a    3     <NA>
4   a    4     last
5   a    5     <NA>
6   b    1    first
7   b    2     <NA>
8   b    3     <NA>
9   b    4     last
10  b    5     <NA>
11  c    1     <NA>
12  c    2     <NA>
13  c    3    first
14  c    4     <NA>
15  c    5     last
16  d    1     <NA>
17  d    2     <NA>
18  d    3     none
19  d    4     <NA>
20  d    5     <NA>
21  e    1     <NA>
22  e    2    first
23  e    3     last
24  e    4     <NA>
25  e    5     <NA>

现在，我们可以插入缺失的 Comments

library(data.table)
y <- setDT(copy(y))
# copy "none" to all rows of the id group in case one Comment is "none" 
y[, Comments := if (isTRUE(any(Comments == "none"))) "none" , by = id][]
# create look-up table
lut <- dcast(y[which(Comments %in% c("first", "last"))], id ~ Comments, value.var = "time")
# update in non-equi joins
y[lut, on = .(id, time < first), Comments := "MVND"][]
y[lut, on = .(id, time > last), Comments := "MVND"][]
y[lut, on = .(id, time > first, time < last), Comments := "no commments"][]

<块引用>

    id time     Comments
 1:  a    1         MVND
 2:  a    2        first
 3:  a    3 no commments
 4:  a    4         last
 5:  a    5         MVND
 6:  b    1        first
 7:  b    2 no commments
 8:  b    3 no commments
 9:  b    4         last
10:  b    5         MVND
11:  c    1         MVND
12:  c    2         MVND
13:  c    3        first
14:  c    4 no commments
15:  c    5         last
16:  d    1         none
17:  d    2         none
18:  d    3         none
19:  d    4         none
20:  d    5         none
21:  e    1         MVND
22:  e    2        first
23:  e    3         last
24:  e    4         MVND
25:  e    5         MVND
    id time     Comments

查找表 lut 包含 first 和 last 出现的次数，分别为：

<块引用>

   id first last
1:  a     2    4
2:  b     1    4
3:  c     3    5
4:  e     2    3

请注意，我们假设生产数据集“表现良好”，即

任何 id 组都包含 "none"
或者 "first" 列中正好是一对 "last" 和 Comments
和 "first" 总是出现在 "last" 之前。

嵌套 if else 语句

2 个答案: