我有一个包含以下布局的数据框:
id |diff
----
1 | 0
1 | 3
1 | 45
1 | 9
1 | 40
1 | 34
1 | 43
1 | 7
2 | 0
2 | 5
3 | 0
3 | 45
3 | 40
我需要以这样一种方式添加一个计数器:
我正在寻找的输出是:
id |diff | counter
-------------
1 | 0 | 1
1 | 3 | 1
1 | 45 | 2
1 | 9 | 2
1 | 40 | 3
1 | 34 | 4
1 | 43 | 5
1 | 7 | 5
2 | 0 | 1
2 | 5 | 1
3 | 0 | 1
3 | 45 | 2
3 | 40 | 3
for循环解决方案是:
for(i in 2:nrow(raw_data)){
raw_data$counter[i]<- ifelse(raw_data$id[i]==raw_data$id[i-1],
ifelse(raw_data$diff> 10,raw_data$counter[i-1] +1,raw_data$counter[i-1])
,1)}
我知道由于'for'循环导致的时间增加。寻找更快的方式。
答案 0 :(得分:4)
她是如何使用dplyr
执行此操作的:
df1 <- read.table(text="id diff
1 0
1 3
1 45
1 9
1 40
1 34
1 43
1 7
2 0
2 5
3 0
3 45
3 40",header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1%>%
group_by(id)%>%
mutate(counter=cumsum(diff>10)+1)
id diff counter
<int> <int> <dbl>
1 1 0 1
2 1 3 1
3 1 45 2
4 1 9 2
5 1 40 3
6 1 34 4
7 1 43 5
8 1 7 5
9 2 0 1
10 2 5 1
11 3 0 1
12 3 45 2
13 3 40 3
答案 1 :(得分:1)
由于OP 正在寻找更快的方式,这里是P Lapointe's dplyr
solution和data.table
版本的基准比较。
data.table
版本是以data.table
语法重写P Lapointe's approach:
library(data.table) # CRAN version 1.10.4 used
DT <- fread(
"id |diff
1 | 0
1 | 3
1 | 45
1 | 9
1 | 40
1 | 34
1 | 43
1 | 7
2 | 0
2 | 5
3 | 0
3 | 45
3 | 40"
, sep = "|")
DT[, counter := cumsum(diff > 10L) + 1L, id]
DT
# id diff counter
# 1: 1 0 1
# 2: 1 3 1
# 3: 1 45 2
# 4: 1 9 2
# 5: 1 40 3
# 6: 1 34 4
# 7: 1 43 5
# 8: 1 7 5
# 9: 2 0 1
#10: 2 5 1
#11: 3 0 1
#12: 3 45 2
#13: 3 40 3
对于基准测试,会创建一个130'000行的更大数据集:
# copy original data set 10000 times
DTlarge <- rbindlist(lapply(seq_len(10000L), function(x) DT))
# make id column unique again
DTlarge[, id := rleid(id)]
dim(DTlarge)
#[1] 130000 2
时间由mircobenchmark
包完成:
df1 <- as.data.frame(DTlarge)
dt1 <- copy(DTlarge)
library(dplyr)
microbenchmark::microbenchmark(
dplyr = {
df1%>%
group_by(id)%>%
mutate(counter=cumsum(diff>10)+1)
},
dt = {
dt1[, counter := cumsum(diff > 10L) + 1L, id]
},
times = 10L
)
结果显示data.table
版本对于此问题大小的速度提高了约20倍:
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 500.51729 505.50173 512.25642 509.64096 517.31095 535.2736 10
dt 23.06037 23.99073 25.30913 24.71059 25.98322 30.7868 10