我有一个如下所示的数据:
set.seed(100)
df<- data.frame(exp = c(rep(LETTERS[1:2], each = 5), "C", "C"),
re = c(rep(seq(1, 5, 1), 2), 1, 2), d = runif(12, 1, 40))
对于exp data.frame
中的每一行,我要制作一个最接近d's
的序列
library(dplyr)
df <- arrange(df, exp, re) %>%
group_by(exp) %>%
mutate(d1 = d, d2 = lead(d), d3 = lead(d2))
我收到了
exp re d d1 d2 d3
1 A 1 25.389088 25.389088 1.233483 27.916293
2 A 2 1.233483 1.233483 27.916293 30.627384
3 A 3 27.916293 27.916293 30.627384 17.219979
4 A 4 30.627384 30.627384 17.219979 NA
5 A 5 17.219979 17.219979 NA NA
6 B 1 25.280619 25.280619 1.468439 28.398679
7 B 2 1.468439 1.468439 28.398679 27.131078
8 B 3 28.398679 28.398679 27.131078 2.971437
9 B 4 27.131078 27.131078 2.971437 NA
10 B 5 2.971437 2.971437 NA NA
11 C 1 9.892981 9.892981 21.860425 NA
12 C 2 21.860425 21.860425 NA NA
我不喜欢NA's
。如果一行中有NA
,则它应该看起来像d1, d2, d3
的最后一个完整序列。例如,NA
中的第4行和第5行中有d3
,因此此行中的d1, d2, d3
值应替换为第3行中的值
我做了for
循环来替换,但是他们花了很多时间处理大数据集。有人可以考虑在dplyr
预期输出为:
exp re d d1 d2 d3
1 A 1 25.389088 25.389088 1.233483 27.916293
2 A 2 1.233483 1.233483 27.916293 30.627384
3 A 3 27.916293 27.916293 30.627384 17.219979
4 A 4 30.627384 27.916293 30.627384 17.219979
5 A 5 17.219979 27.916293 30.627384 17.219979
6 B 1 25.280619 25.280619 1.468439 28.398679
7 B 2 1.468439 1.468439 28.398679 27.131078
8 B 3 28.398679 28.398679 27.131078 2.971437
9 B 4 27.131078 28.398679 27.131078 2.971437
10 B 5 2.971437 28.398679 27.131078 2.971437
11 C 1 9.892981 9.892981 21.860425 0
12 C 2 21.860425 9.892981 21.86042 0
答案 0 :(得分:1)
在OP代码的mutate
步骤之后,我们可以使用mutate_each
替换列中的NA
值&#39; d1&#39;到&#39; d3&#39;。我们创建了if
元素数量大于2的条件,我们replace
位置4以后的元素(which(row_number() >3
)和第三个元素(.[3L]
)或else
我们使用该组中的元素数量(rep.[1L], n())
)复制第一个元素。对于&#39; d3&#39;,exp&#39; C&#39;将有NA
个元素,在下一个mutate
中可以替换为0。
arrange(df, exp, re) %>%
group_by(exp) %>%
mutate(d1=d, d2=lead(d), d3=lead(d2)) %>%
mutate_each(funs(if(all(n()>2)) replace(., which(row_number()>3),
.[3L]) else rep(.[1L], n())), d1:d3) %>%
mutate(d3= replace(d3, is.na(d3), 0))
# exp re d d1 d2 d3
#1 A 1 25.389088 25.389088 1.233483 27.916293
#2 A 2 1.233483 1.233483 27.916293 30.627384
#3 A 3 27.916293 27.916293 30.627384 17.219979
#4 A 4 30.627384 27.916293 30.627384 17.219979
#5 A 5 17.219979 27.916293 30.627384 17.219979
#6 B 1 25.280619 25.280619 1.468439 28.398679
#7 B 2 1.468439 1.468439 28.398679 27.131078
#8 B 3 28.398679 28.398679 27.131078 2.971437
#9 B 4 27.131078 28.398679 27.131078 2.971437
#10 B 5 2.971437 28.398679 27.131078 2.971437
#11 C 1 9.892981 9.892981 21.860425 0.000000
#12 C 2 21.860425 9.892981 21.860425 0.000000
或者我们可以使用shift
的devel版本中的data.table
,即v1.9.5
。安装devel版本的说明是here
我们转换了&#39; data.frame&#39;到&#39; data.table&#39; (setDT(df)
),order
由&#39; exp&#39;和&#39; re&#39;列。我们按照&#39; exp&#39;,shift
进行分组,指定n=0:2
和type='lead'
以获得3个新列(& #39; TMP&#39)。根据&#39; tmp&#39;的最后一列创建逻辑索引(&#39; i1&#39;) (is.na(tmp[[3]])
)。通过获取非NA(!i1
)和添加(+
)组的TRUE
值的元素的累积总和来创建数字索引(&#39; i2&#39;)只有NA
用于&#39; d3&#39;列(all(i1)
)。循环'tmp&#39;使用lapply
的列,使用&#39; i2&#39;作为提取行的索引。最后,更改&#39; d3&#39;中的NA
值。到0。
library(data.table)#v1.9.5+
setDT(df)[order(exp,re), paste0('d', 1:3) := {
tmp <- shift(d, 0:2, type='lead')
i1 <- is.na(tmp[[3]])
i2 <- cumsum(!i1) + all(i1)
lapply(tmp, function(x) x[i2])
}, by = exp]
df[is.na(d3), d3:=0]
df
# exp re d d1 d2 d3
# 1: A 1 25.389088 25.389088 1.233483 27.916293
# 2: A 2 1.233483 1.233483 27.916293 30.627384
# 3: A 3 27.916293 27.916293 30.627384 17.219979
# 4: A 4 30.627384 27.916293 30.627384 17.219979
# 5: A 5 17.219979 27.916293 30.627384 17.219979
# 6: B 1 25.280619 25.280619 1.468439 28.398679
# 7: B 2 1.468439 1.468439 28.398679 27.131078
# 8: B 3 28.398679 28.398679 27.131078 2.971437
# 9: B 4 27.131078 28.398679 27.131078 2.971437
#10: B 5 2.971437 28.398679 27.131078 2.971437
#11: C 1 9.892981 9.892981 21.860425 0.000000
#12: C 2 21.860425 9.892981 21.860425 0.000000
df <- structure(list(exp = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "C", "C"), re = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L,
5L, 1L, 2L), d = c(25.389088, 1.233483, 27.916293, 30.627384,
17.219979, 25.280619, 1.468439, 28.398679, 27.131078, 2.971437,
9.892981, 21.860425)), .Names = c("exp", "re", "d"), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"),
class = "data.frame")