说我有以下样本数据集:
iris <- data.table(iris)[c(1:5,51:55,101:105), list(ID=.I, Species,Sepal.Length)]
然后说我想计算组内行之间的绝对差异(在本例中为Species
)。
iris[ , SL.Diff := c(NA,abs(diff(Sepal.Length))) , by = Species]
此时,我有一个如下所示的数据集:
ID Species Sepal.Length SL.Diff
1: 1 setosa 5.1 NA
2: 2 setosa 4.9 0.2
3: 3 setosa 4.7 0.2
4: 4 setosa 4.6 0.1
5: 5 setosa 5.0 0.4
6: 6 versicolor 7.0 NA
现在我想计算一个新的变量Sepal.Length2
,如果SL.Diff
小于0.3的阈值,它会接受下一行的值。
iris[ , Sepal.Length2 := ifelse(SL.Diff < 0.3, iris[ID+1]$Sepal.Length, Sepal.Length)]
这是我想要的方式。但是,如果我想进行相同的比较,而不是采取下一行,我想采取前一行的值?
iris[ , Sepal.Length3 := ifelse(SL.Diff < 0.3, iris[ID-1]$Sepal.Length, Sepal.Length)]
Sepal.Length3
没有给出我期望的输出。有谁知道我在这里做错了什么?
ID Species Sepal.Length SL.Diff Sepal.Length2 Sepal.Length3
1: 1 setosa 5.1 NA NA NA
2: 2 setosa 4.9 0.2 4.7 4.9
3: 3 setosa 4.7 0.2 4.6 4.7
4: 4 setosa 4.6 0.1 5.0 4.6
5: 5 setosa 5.0 0.4 5.0 5.0
6: 6 versicolor 7.0 NA NA NA
7: 7 versicolor 6.4 0.6 6.4 6.4
8: 8 versicolor 6.9 0.5 6.9 6.9
9: 9 versicolor 5.5 1.4 5.5 5.5
10: 10 versicolor 6.5 1.0 6.5 6.5
11: 11 virginica 6.3 NA NA NA
12: 12 virginica 5.8 0.5 5.8 5.8
13: 13 virginica 7.1 1.3 7.1 7.1
14: 14 virginica 6.3 0.8 6.3 6.3
15: 15 virginica 6.5 0.2 NA 5.1
答案 0 :(得分:5)
不确定此速度的影响,但这是另一次尝试:
# make a column of the next values using head()
iris[, S3 := c(NA,head(Sepal.Length,-1)), by=Species]
# overwrite those values not meeting your criteria with the original values
iris[ !(SL.Diff < 0.3), S3 := Sepal.Length]
iris
# ID Species Sepal.Length SL.Diff S3
# 1: 1 setosa 5.1 NA NA
# 2: 2 setosa 4.9 0.2 5.1
# 3: 3 setosa 4.7 0.2 4.9
# 4: 4 setosa 4.6 0.1 4.7
# 5: 5 setosa 5.0 0.4 5.0
# 6: 6 versicolor 7.0 NA NA
# 7: 7 versicolor 6.4 0.6 6.4
# 8: 8 versicolor 6.9 0.5 6.9
# 9: 9 versicolor 5.5 1.4 5.5
#10: 10 versicolor 6.5 1.0 6.5
#11: 11 virginica 6.3 NA NA
#12: 12 virginica 5.8 0.5 5.8
#13: 13 virginica 7.1 1.3 7.1
#14: 14 virginica 6.3 0.8 6.3
#15: 15 virginica 6.5 0.2 6.3
答案 1 :(得分:4)
data.table.[
评估相关data.table范围内的i
和j
。
因此
iris[ID+1]$Sepal.Length
在ID
范围内(第二次)评估iris
。
您的问题确实出现了,因为您正在创建0
索引(由R
静默删除)
a <- c('a','b')
a[0:1]
# [1] "a"
a[1]
# [1] "a"
所以,你需要更好地处理&#34;已知的NA值&#34;和隐含的NA值。
这是一种方法
# calculate the "threshold" column
iris[,thresh := SL.Diff <0.3]
# where does it need to go "up" and what indexed value need it go up by
iris[!is.na(thresh), up := ifelse(thresh, ID+1L,ID)]
# create the column
iris[, S2 := Sepal.Length[up]]
# the same for "down"
iris[!is.na(thresh), down := ifelse(thresh, ID-1L,ID)]
iris[, S3 := Sepal.Length[down]]
iris
# ID Species Sepal.Length SL.Diff thresh up S2 down S3
# 1: 1 setosa 5.1 NA NA NA NA NA NA
# 2: 2 setosa 4.9 0.2 TRUE 3 4.7 1 5.1
# 3: 3 setosa 4.7 0.2 TRUE 4 4.6 2 4.9
# 4: 4 setosa 4.6 0.1 TRUE 5 5.0 3 4.7
# 5: 5 setosa 5.0 0.4 FALSE 5 5.0 5 5.0
# 6: 6 versicolor 7.0 NA NA NA NA NA NA
# 7: 7 versicolor 6.4 0.6 FALSE 7 6.4 7 6.4
# 8: 8 versicolor 6.9 0.5 FALSE 8 6.9 8 6.9
# 9: 9 versicolor 5.5 1.4 FALSE 9 5.5 9 5.5
# 10: 10 versicolor 6.5 1.0 FALSE 10 6.5 10 6.5
# 11: 11 virginica 6.3 NA NA NA NA NA NA
# 12: 12 virginica 5.8 0.5 FALSE 12 5.8 12 5.8
# 13: 13 virginica 7.1 1.3 FALSE 13 7.1 13 7.1
# 14: 14 virginica 6.3 0.8 FALSE 14 6.3 14 6.3
# 15: 15 virginica 6.5 0.2 TRUE 16 NA 14 6.3
答案 2 :(得分:1)
我认为通过提供lead()
和lag()
功能,dplyr可以更轻松地表达:
library(dplyr)
iris2 <- iris[c(1:5, 51:55, 101:105), c("Species", "Sepal.Length")]
names(iris2) <- c("species", "sepal")
iris2$id <- 1:15
iris2 %>%
group_by(species) %>%
mutate(
thres = abs(sepal - lag(sepal)),
up = ifelse(thres < 0.3, lead(sepal), sepal),
down = ifelse(thres < 0.3, lag(sepal), sepal)
)