我有以下问题:我有下表:
> data
StartPoint EndPoint timeDiff
1 A91 TX043 258
2 A91 TX048 547
3 A92 TX088 330
4 A91 TX088 289
5 A91 TX043 387
6 A92 TX088 241
7 A91 TX088 213
8 A92 TX043 295
9 A91 TX088 518
10 A92 TX088 414
我需要以下形式的聚合:
StartPoint EndPoint count mean(timeDiff)
A91 TX088 3 mean of 289,213 and 518
A91 TX043 2 mean of 258 and 387
A91 TX048 1 547
A92 TX088 3 mean of 330, 241 and 414
A92 TX043 1 295
count是相同StartPoint和EndPoint对的出现次数,mean是具有相同StartPoint和EndPoint对的条目的timeDiff的平均值。结果应该在StartPoint,count和EndPoint上进行排序。
非常感谢任何帮助。
提前致谢, 杉
我的数据:
data <- structure(list(StartPoint = c("A91", "A91", "A92", "A91", "A91", "A92", "A91", "A92", "A91", "A92"), EndPoint = c("TX043", "TX048", "TX088", "TX088", "TX043", "TX088", "TX088", "TX043", "TX088", "TX088"), timeDiff = c(258, 547, 330, 289, 387, 241, 213, 295, 518, 414)), .Names = c("StartPoint", "EndPoint", "timeDiff"), row.names = c(NA, 10L), class = "data.frame")
答案 0 :(得分:4)
您可以使用基本功能aggregate
执行此操作:
aggregate(timeDiff~StartPoint+EndPoint,data,function(x) cbind(length(x),mean(x)))
StartPoint EndPoint timeDiff.1 timeDiff.2
1 A91 TX043 2.0000 322.5000
2 A92 TX043 1.0000 295.0000
3 A91 TX048 1.0000 547.0000
4 A91 TX088 3.0000 340.0000
5 A92 TX088 3.0000 328.3333
但ddply
包中的plyr
可能会提供更令人满意的结果:
library(plyr)
ddply(data,c(.(StartPoint),.(EndPoint)),summarise,count=length(timeDiff),mean=mean(timeDiff))
StartPoint EndPoint count mean
1 A91 TX043 2 322.5000
2 A91 TX048 1 547.0000
3 A91 TX088 3 340.0000
4 A92 TX043 1 295.0000
5 A92 TX088 3 328.3333
答案 1 :(得分:3)
您可以使用例如data.table:
library(data.table)
data <- data.table(data)
data[, list(count=length(timeDiff), mean=mean(timeDiff)), by=c("StartPoint", "EndPoint")]
StartPoint EndPoint count mean
1: A91 TX043 2 322.5000
2: A91 TX048 1 547.0000
3: A92 TX088 3 328.3333
4: A91 TX088 3 340.0000
5: A92 TX043 1 295.0000