从一列中的观察到几列中的序列

时间:2015-08-28 07:46:32

标签: r dplyr

我有一个如下所示的数据:

set.seed(100)    
df<- data.frame(exp = c(rep(LETTERS[1:2], each = 5), "C", "C"), 
    re = c(rep(seq(1, 5, 1), 2), 1, 2), d = runif(12, 1, 40))

对于exp data.frame中的每一行,我要制作一个最接近d's的序列

library(dplyr)
df <- arrange(df, exp, re) %>% 
group_by(exp) %>% 
mutate(d1 = d, d2 = lead(d), d3 = lead(d2))

我收到了

   exp re         d        d1        d2        d3
1    A  1 25.389088 25.389088  1.233483 27.916293
2    A  2  1.233483  1.233483 27.916293 30.627384
3    A  3 27.916293 27.916293 30.627384 17.219979
4    A  4 30.627384 30.627384 17.219979        NA
5    A  5 17.219979 17.219979        NA        NA
6    B  1 25.280619 25.280619  1.468439 28.398679
7    B  2  1.468439  1.468439 28.398679 27.131078
8    B  3 28.398679 28.398679 27.131078  2.971437
9    B  4 27.131078 27.131078  2.971437        NA
10   B  5  2.971437  2.971437        NA        NA
11   C  1  9.892981  9.892981 21.860425        NA
12   C  2 21.860425 21.860425        NA        NA

我不喜欢NA's。如果一行中有NA,则它应该看起来像d1, d2, d3的最后一个完整序列。例如,NA中的第4行和第5行中有d3,因此此行中的d1, d2, d3值应替换为第3行中的值 我做了for循环来替换,但是他们花了很多时间处理大数据集。有人可以考虑在dplyr

中制作它

预期输出为:

 exp re         d        d1        d2        d3
1    A  1 25.389088 25.389088  1.233483 27.916293
2    A  2  1.233483  1.233483 27.916293 30.627384
3    A  3 27.916293 27.916293 30.627384 17.219979
4    A  4 30.627384 27.916293 30.627384 17.219979
5    A  5 17.219979 27.916293 30.627384 17.219979
6    B  1 25.280619 25.280619  1.468439 28.398679
7    B  2  1.468439  1.468439 28.398679 27.131078
8    B  3 28.398679 28.398679 27.131078  2.971437
9    B  4 27.131078 28.398679 27.131078  2.971437
10   B  5  2.971437  28.398679 27.131078  2.971437
11   C  1  9.892981  9.892981 21.860425        0
12   C  2 21.860425 9.892981 21.86042        0

1 个答案:

答案 0 :(得分:1)

在OP代码的mutate步骤之后,我们可以使用mutate_each替换列中的NA值&#39; d1&#39;到&#39; d3&#39;。我们创建了if元素数量大于2的条件,我们replace位置4以后的元素(which(row_number() >3)和第三个元素(.[3L])或else我们使用该组中的元素数量(rep.[1L], n()))复制第一个元素。对于&#39; d3&#39;,exp&#39; C&#39;将有NA个元素,在下一个mutate中可以替换为0。

arrange(df, exp, re) %>% 
      group_by(exp) %>% 
      mutate(d1=d, d2=lead(d), d3=lead(d2)) %>% 
      mutate_each(funs(if(all(n()>2)) replace(., which(row_number()>3),
                .[3L]) else rep(.[1L], n())), d1:d3) %>% 
      mutate(d3= replace(d3, is.na(d3), 0))

#   exp re         d        d1        d2        d3
#1    A  1 25.389088 25.389088  1.233483 27.916293
#2    A  2  1.233483  1.233483 27.916293 30.627384
#3    A  3 27.916293 27.916293 30.627384 17.219979
#4    A  4 30.627384 27.916293 30.627384 17.219979
#5    A  5 17.219979 27.916293 30.627384 17.219979
#6    B  1 25.280619 25.280619  1.468439 28.398679
#7    B  2  1.468439  1.468439 28.398679 27.131078
#8    B  3 28.398679 28.398679 27.131078  2.971437
#9    B  4 27.131078 28.398679 27.131078  2.971437
#10   B  5  2.971437 28.398679 27.131078  2.971437
#11   C  1  9.892981  9.892981 21.860425  0.000000
#12   C  2 21.860425  9.892981 21.860425  0.000000

或者我们可以使用shift的devel版本中的data.table,即v1.9.5。安装devel版本的说明是here

我们转换了&#39; data.frame&#39;到&#39; data.table&#39; (setDT(df)),order由&#39; exp&#39;和&#39; re&#39;列。我们按照&#39; exp&#39;,shift进行分组,指定n=0:2type='lead'以获得3个新列(& #39; TMP&#39)。根据&#39; tmp&#39;的最后一列创建逻辑索引(&#39; i1&#39;) (is.na(tmp[[3]]))。通过获取非NA(!i1)和添加(+)组的TRUE值的元素的累积总和来创建数字索引(&#39; i2&#39;)只有NA用于&#39; d3&#39;列(all(i1))。循环'tmp&#39;使用lapply的列,使用&#39; i2&#39;作为提取行的索引。最后,更改&#39; d3&#39;中的NA值。到0。

library(data.table)#v1.9.5+
setDT(df)[order(exp,re), paste0('d', 1:3) := {
                  tmp <- shift(d, 0:2, type='lead')
                  i1 <- is.na(tmp[[3]])
                  i2 <- cumsum(!i1) + all(i1) 
                  lapply(tmp, function(x) x[i2])
                  }, by = exp]
df[is.na(d3), d3:=0]
df
#   exp re         d        d1        d2        d3
# 1:   A  1 25.389088 25.389088  1.233483 27.916293
# 2:   A  2  1.233483  1.233483 27.916293 30.627384
# 3:   A  3 27.916293 27.916293 30.627384 17.219979
# 4:   A  4 30.627384 27.916293 30.627384 17.219979
# 5:   A  5 17.219979 27.916293 30.627384 17.219979
# 6:   B  1 25.280619 25.280619  1.468439 28.398679
# 7:   B  2  1.468439  1.468439 28.398679 27.131078
# 8:   B  3 28.398679 28.398679 27.131078  2.971437
# 9:   B  4 27.131078 28.398679 27.131078  2.971437
#10:   B  5  2.971437 28.398679 27.131078  2.971437
#11:   C  1  9.892981  9.892981 21.860425  0.000000
#12:   C  2 21.860425  9.892981 21.860425  0.000000

数据

df <- structure(list(exp = c("A", "A", "A", "A", "A", "B", "B", "B", 
"B", "B", "C", "C"), re = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 
5L, 1L, 2L), d = c(25.389088, 1.233483, 27.916293, 30.627384, 
17.219979, 25.280619, 1.468439, 28.398679, 27.131078, 2.971437, 
9.892981, 21.860425)), .Names = c("exp", "re", "d"), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"),
class = "data.frame")