我有一个带有两个参数(日期和状态)的data.table,现在我想根据原始表插入新列。
数据规则:
新变量:
例如,一个简单的输入:
生成数据的代码:
Cast(12.50 +0.01001 AS NUMERIC(22, 5)) AS num
包括新参数的输出是:
实际上我已经完成了一些基本方法,
我认为这个案例可以有一个更简单的解决方案
答案 0 :(得分:2)
因此,从1到0的过渡标志着一个组的边界。您可以使用cumsum
和diff
来实现此目的。对于@ zx8754答案中的x
示例:
data.frame(x, group_id = c(1, cumsum(diff(x) == -1) + 1))
x group_id
1 0 1
2 0 1
3 0 1
4 1 1
5 1 1
6 0 2
7 0 2
8 1 2
9 0 3
对于更现实的大小示例:
res = data.frame(status = sample(c(0,1), 10e7, replace = TRUE))
system.time(res$group_id <- c(1, cumsum(diff(res$status) == -1) + 1))
user system elapsed
2.770 1.680 4.449
> head(res, 20)
status group_id
1 0 1
2 0 1
3 1 1
4 0 2
5 0 2
6 0 2
7 1 2
8 1 2
9 0 3
10 1 3
11 1 3
12 0 4
13 1 4
14 0 5
15 0 5
16 1 5
17 0 6
18 0 6
19 1 6
20 0 7
1000万条记录的5秒非常快(虽然这取决于你对fast的定义:)。
<强>基准强>
set.seed(1)
res = data.frame(status = sample(c(0,1), 10e4, replace = TRUE))
microbenchmark::microbenchmark(
rleid = {
gr <- data.table::rleid(res$status)
x1 <- as.numeric(as.factor(ifelse(gr %% 2 == 0, gr - 1, gr)))
# removing "as.numeric(as.factor" helps, but still not as fast as cumsum
#x1 <- ifelse(gr %% 2 == 0, gr - 1, gr)
},
cumsum = { x2 <- c(1, cumsum(diff(res$status) == -1) + 1) }
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# rleid 118.161287 120.149619 122.673747 121.736122 123.271881 168.88777 100 b
# cumsum 1.511811 1.559563 2.221273 1.826404 2.475402 6.88169 100 a
identical(x1, x2)
# [1] TRUE
答案 1 :(得分:2)
试试这个:
#dummy data
x <- c(0,0,0,1,1,0,0,1,0)
#get group id using rleid from data.table
gr <- data.table::rleid(x)
#merge separated 0,1 groups
gr <- ifelse(gr %% 2 == 0, gr - 1, gr)
#result
cbind(x, gr)
# x gr
# [1,] 0 1
# [2,] 0 1
# [3,] 0 1
# [4,] 1 1
# [5,] 1 1
# [6,] 0 3
# [7,] 0 3
# [8,] 1 3
# [9,] 0 5
#if we need to have group names sequential then
cbind(x, gr = as.numeric(as.factor(gr)))
# x gr
# [1,] 0 1
# [2,] 0 1
# [3,] 0 1
# [4,] 1 1
# [5,] 1 1
# [6,] 0 2
# [7,] 0 2
# [8,] 1 2
# [9,] 0 3