我有以长格式存储的数据:
case class
我想通过id识别重复项并按日期时间平均这些ID下的值来平均我的重复项。在这里,我有两个重复。我想要的结果如下:
datetime <- c("2016-04-03 00:00:00", "2016-04-03 00:15:00", "2016-04-03 00:30:00",
"2016-04-03 00:00:00", "2016-04-03 00:15:00", "2016-04-03 00:30:00")
temp <- c(46.7, 47.8, 46.2, 72.0, 71.1, 70.6)
humid <- c(63.8, 60.3, 64.4, 24.1, 24.2, 24.3)
id <- c("RSOSW_16", "RSOSW_16", "RSOSW_16", "RSOSE_16", "RSOSE_16", "RSOSE_16")
df <- data.frame(datetime,temp,humid,id)
我还想在数据框中保留其他几列信息。提前谢谢。
答案 0 :(得分:3)
如果您想忽略ID(这将与您给出的预期结果相符):
df <- as.data.frame(cbind(datetime,temp,humid,id))
df$temp <- as.numeric(as.character(df$temp))
df$humid <- as.numeric(as.character(df$humid))
aggregate(df[,colnames(df) %in% c("temp", "humid")],by=list(df$datetime),mean)
Group.1 temp humid 1 2016-04-03 00:00:00 59.35 43.95 2 2016-04-03 00:15:00 59.45 42.25 3 2016-04-03 00:30:00 58.40 44.35
如果你想区分ID(这将不匹配你给出的预期结果,因为该结果没有将ROSW_16
和ROSE_16
分开:
aggregate(df[,colnames(df) %in% c("temp", "humid")],by=list(df$id, df$datetime),mean)
Group.1 Group.2 temp humid 1 RSOSE_16 2016-04-03 00:00:00 72.0 24.1 2 RSOSW_16 2016-04-03 00:00:00 46.7 63.8 3 RSOSE_16 2016-04-03 00:15:00 71.1 24.2 4 RSOSW_16 2016-04-03 00:15:00 47.8 60.3 5 RSOSE_16 2016-04-03 00:30:00 70.6 24.3 6 RSOSW_16 2016-04-03 00:30:00 46.2 64.4
根据@thelatemail的评论,公式界面可能更整洁:
aggregate(cbind(temp,humid) ~ datetime, data=df, mean)
或只是
aggregate(df[,c("temp", "humid")], df["datetime"], mean)
答案 1 :(得分:1)
使用data.table
包,我们可以按日期时间和ID分组平均值:
dt=data.table(datetime, temp,humid,id)
dt[, .(temp = mean(temp), humid=mean(humid)), by=.(datetime,id)]
# datetime id temp humid
#1: 2016-04-03 00:00:00 RSOSW_16 46.7 63.8
#2: 2016-04-03 00:15:00 RSOSW_16 47.8 60.3
#3: 2016-04-03 00:30:00 RSOSW_16 46.2 64.4
#4: 2016-04-03 00:00:00 RSOSE_16 72.0 24.1
#5: 2016-04-03 00:15:00 RSOSE_16 71.1 24.2
#6: 2016-04-03 00:30:00 RSOSE_16 70.6 24.3
或者,只按日期时间分组,我们得到
dt[, .(temp = mean(temp), humid=mean(humid)), by=.(datetime)]
# datetime temp humid
#1: 2016-04-03 00:00:00 59.35 43.95
#2: 2016-04-03 00:15:00 59.45 42.25
#3: 2016-04-03 00:30:00 58.40 44.35
答案 2 :(得分:0)
我们可以在summarise_each
dplyr
library(dplyr)
as.data.frame(df, stringsAsFactors=FALSE) %>%
group_by(datetime) %>%
summarise_each(funs(mean(as.numeric(.))), temp, humid)
# datetime temp humid
# <chr> <dbl> <dbl>
#1 2016-04-03 00:00:00 59.35 43.95
#2 2016-04-03 00:15:00 59.45 42.25
#3 2016-04-03 00:30:00 58.40 44.35