#For say, I got a situation like this
user_id = c(1:5,1:5)
time = c(1:10)
visit_log = data.frame(user_id, time)
#And I've wrote a method to calculate interval
interval <- function(data) {
interval = c(Inf)
for (i in seq(1, length(data$time))) {
intv = data$time[i]-data$time[i-1]
interval = append(interval, intv)
}
data$interval = interval
return (data)
}
#But when I want to get intervals by user_id and bind them to the data.frame,
#I can't find a proper way
#Is there any method to get something like
new_data = merge(by(visit_log, INDICE=visit_log$user_id, FUN=interval))
#And the result should be
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
答案 0 :(得分:3)
我们可以用diff()
函数替换你的循环,该函数计算向量中相邻索引之间的差异,例如:
> diff(c(1,3,6,10))
[1] 2 3 4
我们可以通过Inf
将c(Inf, diff(x))
添加到差异之前。
我们接下来需要的是将上述内容分别应用于每个user_id
。为此,有很多选项,但在这里我使用aggregate()
。令人困惑的是,此函数返回一个数据框,其中time
组件本身就是一个矩阵。我们需要将该矩阵转换为向量,依赖于在R中首先填充矩阵列的事实。最后,我们根据您原始版本的函数在输入数据中添加interval
列。
interval <- function(x) {
diffs <- aggregate(time ~ user_id, data = x, function(y) c(Inf, diff(y)))
diffs <- as.numeric(diffs$time)
x <- within(x, interval <- diffs)
x
}
这是一个稍微扩展的示例,每个用户有3个时间点,以说明上述功能:
> visit_log = data.frame(user_id = rep(1:5, 3), time = 1:15)
> interval(visit_log)
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
11 1 11 5
12 2 12 5
13 3 13 5
14 4 14 5
15 5 15 5