我有一个数据数组,其中包含有关人员和项目的一些信息:
person_id | project_id | action | time
--------------------------------------
1 | 1 | w | 1
1 | 2 | w | 2
1 | 3 | w | 2
1 | 3 | r | 3
1 | 3 | w | 4
1 | 4 | w | 4
2 | 2 | r | 2
2 | 2 | w | 3
我想用一些名为“first_time”和“first_time_project”的字段来扩充这些数据,这些字段共同确定第一次看到该人的任何操作,并且第一次开发人员看到对项目的任何操作。最后,数据应如下所示:
person_id | project_id | action | time | first_time | first_time_project
------------------------------------------------------------------------
1 | 1 | w | 1 | 1 | 1
1 | 2 | w | 2 | 1 | 2
1 | 3 | w | 2 | 1 | 2
1 | 3 | r | 3 | 1 | 2
1 | 3 | w | 4 | 1 | 2
1 | 4 | w | 4 | 1 | 4
2 | 2 | r | 2 | 2 | 2
2 | 2 | w | 3 | 2 | 2
我这样做的天真的方式是编写几个循环:
for (pid in unique(data$person_id)) {
data[data$pid==pid, "first_time"] = min(data[data$pid==pid, "time"])
for (projid in unique(data[data$pid==pid, "project_id"])) {
data[data$pid==pid & data$project_id==projid, "first_time_project"] = min(data[data$pid==pid & data$project_id==projid, "time"]
}
}
现在,通过双嵌套循环看到这种情况会变得非常慢,这并不是天才。但是,我无法想办法在R中处理这个问题。我有点仿效SQL的group by选项。我知道也许可以提供帮助,但我无法弄清楚如何做多个切片。
有关如何将我的代码从冰川缓慢变为更快的某些提示?我现在对蜗牛感到高兴。
答案 0 :(得分:4)
Hadley的plyr和transform()的组合非常强大。如果我正确理解你的问题,那么:
foo <- ddply(foo, .(person_id), transform, first_time=min(time))
foo <- ddply(foo, .(person_id, project_id), transform,
first_time_project=min(time))
答案 1 :(得分:4)
尝试ave
:
transform(data,
first_time = ave(time, person_id, FUN = min),
first_time_project = ave(time, person_id, project_id, drop = TRUE, FUN = min)
)
答案 2 :(得分:3)
如果您正在寻找速度,那么data.table
就是您的选择。
library(data.table)
DT <- data.table(foo)
DT[, first_time := min(time), by = person_id]
DT[, first_time_project := min(time), by = list(person_id, project_id)]
答案 3 :(得分:1)
没有循环的快速而肮脏的解决方案
library(plyr)
# function to get first time by any person/project
fp <- function(dat)
{
dat$first_time=min(dat$time)
ftp <- function(d) { d$first_time_project=min(d$time); return (d) }
dat=ddply(dat, .(project_id), ftp)
return (dat)
}
#this single call should give you the result you want
result=ddply(data, .(person_id), fp)
答案 4 :(得分:0)
我能想到的快速方法:
foo <- data.frame(
person_id=rep(1:5,each=6),
project_id=sample(1:5,30,T),
time=sample(1:30))
first_time <- aggregate(foo$time, list(foo$person_id), min)
foo$first_time <- first_time[ match(foo$person_id,first_time[,1]),2]
bar <- subset(foo, time==first_time)
foo$first_time_project <- bar$project_id[match(foo$person_id, bar$person_id)]