我使用ddply
计算rmse,为每个id,条件组合的大数据框计算其他摘要统计信息。数据框的结构是
'data.frame': 107955 obs. of 11 variables:
$ date : Factor w/ 1077 levels "2012-08-17","2012-08-18",..: 487 488 489 490 491 492 493 494 495 496 ...
$ value : num
$ mean : num
$ accuracy : num
$ id : int
$ criteria : Factor w/ 5 levels
我尝试了以下
ddply(foo, .(id, criteria), summarize, mean=mean(accuracy, na.rm=T), median=median(accuracy, na.rm=T), rmse=sqrt(sum((mean - value)^2 , na.rm = TRUE ) / nrow(foo)))
nrow(foo)
给出整个数据帧的行数,而不是切片的行数(id,criteria)。
我尝试使用显然不对的nrow(.(id, criteria))
示例数据:http://pastebin.com/8m0vD5Bq
ddply(foo, .(id, criteria), summarize, mean=mean(accuracy, na.rm=T), median=median(accuracy, na.rm=T), rmse=sqrt(sum((mean - value)^2 , na.rm = TRUE ) / n()))
id criteria mean median rmse
1 49 g 123.00 123.0 101.00
2 49 h 115.25 72.0 80.31
3 49 I 196.00 110.0 173.75
4 50 f 191.75 204.5 168.59
5 50 g 649.00 275.0 634.92
6 51 d 180.00 180.0 160.00
7 51 e 378.67 137.5 359.19
8 51 f 247.00 247.0 227.08
9 52 a 109.00 107.0 74.18
10 52 b 76.33 45.0 46.31
11 52 d 98.67 100.0 64.56
计算rmse的id = 50和标准=' g'
sub_foo <- foo[foo$id == 50 & foo$criteria=='g',]
R> sub_foo
date value mean accuracy id criteria
23 2014-01-08 2 37 1850 50 g
24 2014-01-09 12 33 275 50 g
25 2014-01-10 19 48 253 50 g
26 2014-01-11 35 35 100 50 g
27 2014-01-12 3 23 767 50 g
R> sqrt(sum((sub_foo$mean -sub_foo$value)^2 , na.rm = TRUE ) / nrow(sub_foo))
[1] 24.11
预期的rmse是24.11而不是我使用ddply获得634.92这是错误的。
编辑:添加数据帧的输入
R>dput(foo)
structure(list(date = structure(1:36, .Label = c("2013-12-17",
"2013-12-18", "2013-12-19", "2013-12-20", "2013-12-21", "2013-12-22",
"2013-12-23", "2013-12-24", "2013-12-25", "2013-12-26", "2013-12-27",
"2013-12-28", "2013-12-29", "2013-12-30", "2013-12-31", "2014-01-01",
"2014-01-02", "2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06",
"2014-01-07", "2014-01-08", "2014-01-09", "2014-01-10", "2014-01-11",
"2014-01-12", "2014-01-13", "2014-01-14", "2014-01-15", "2014-01-16",
"2014-01-17", "2014-01-18", "2014-01-19", "2014-01-20", "2014-01-21"
), class = "factor"), value = c(33L, 30L, 42L, 15L, 36L, 44L,
31L, 30L, 42L, 20L, 25L, 9L, 25L, 17L, 3L, 39L, 14L, 26L, 14L,
41L, 23L, 16L, 2L, 12L, 19L, 35L, 3L, 22L, 8L, 50L, 48L, 41L,
30L, 40L, 6L, 15L), mean = c(33L, 36L, 45L, 25L, 6L, 20L, 34L,
30L, 36L, 36L, 19L, 49L, 11L, 32L, 40L, 34L, 47L, 41L, 45L, 15L,
25L, 48L, 37L, 33L, 48L, 35L, 23L, 27L, 24L, 28L, 42L, 7L, 14L,
37L, 31L, 19L), accuracy = c(100L, 120L, 107L, 167L, 17L, 45L,
110L, 100L, 86L, 180L, 76L, 544L, 44L, 188L, 1333L, 87L, 336L,
158L, 321L, 37L, 109L, 300L, 1850L, 275L, 253L, 100L, 767L, 123L,
300L, 56L, 88L, 17L, 47L, 93L, 517L, 127L), id = c(52L, 52L,
52L, 52L, 52L, 52L, 52L, 52L, 52L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 49L,
49L, 49L, 49L, 49L, 49L, 49L, 49L, 49L), criteria = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L,
5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 8L,
8L, 8L, 8L), .Label = c("a", "b", "d", "e", "f", "g", "h", "I"
), class = "factor")), .Names = c("date", "value", "mean", "accuracy",
"id", "criteria"), class = "data.frame", row.names = c(NA, -36L
))
答案 0 :(得分:0)
对我有用的解决方案是使用自定义函数而不是使用汇总,其中,我可以使用nrow()
来获取切片中的行数。
解决方案:
metrics <- ddply(foo, c("id", "criteria"), function(df) data.frame(mean=mean(df$accuracy, na.rm=T), median=median(df$accuracy, na.rm=T), rmse=sqrt(sum((df$mean - df$value)^2 , na.rm = TRUE ) / nrow(df))))
感谢指点。