data.table
我遇到了这个问题,最近让我发疯了。它看起来像一个bug但可能是我在这里遗漏了一些明显的东西。
我有以下数据框:
# First some data
data <- data.table(structure(list(
month = structure(c(1356998400, 1356998400, 1356998400,
1359676800, 1354320000, 1359676800, 1359676800, 1356998400, 1356998400,
1354320000, 1354320000, 1354320000, 1359676800, 1359676800, 1359676800,
1356998400, 1359676800, 1359676800, 1356998400, 1359676800, 1359676800,
1359676800, 1359676800, 1354320000, 1354320000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"),
portal = c(TRUE, TRUE, FALSE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
),
satisfaction = c(10L, 10L, 10L, 9L, 10L, 10L, 9L, 10L, 10L,
9L, 2L, 8L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 10L, 9L, 10L, 9L,
10L, 10L)),
.Names = c("month", "portal", "satisfaction"),
row.names = c(NA, -25L), class = "data.frame"))
我想通过portal
和month
对其进行总结。总结良好的旧tapply
按预期工作 - 我得到3x2矩阵,结果显示2012年12月和2013年1月至2月:
> tapply(data$satisfaction, list(data$month, data$portal), mean)
FALSE TRUE
2012-12-01 8.5 8.000000
2013-01-01 10.0 10.000000
2013-02-01 9.0 9.545455
总结by
的{{1}}参数不会:
data.table
如您所见,它会返回一个数据表,其中包含 8 值,而不是 6 ;例如,> data[, mean(satisfaction), by = 'month,portal']
month portal V1
1: 2013-01-01 FALSE 10.000000
2: 2013-02-01 TRUE 9.000000
3: 2013-01-01 TRUE 10.000000
4: 2012-12-01 FALSE 8.500000
5: 2012-12-01 TRUE 7.333333
6: 2013-02-01 TRUE 9.666667
7: 2013-02-01 FALSE 9.000000
8: 2012-12-01 TRUE 10.000000
和portal == TRUE
重复的值。
有趣的是,如果我将这仅限于2013年的数据,那么一切都按预期进行:
month == 2012-02-01
我很困惑超越相信:)。有人可以帮帮我吗?
答案 0 :(得分:5)
问题似乎与排序有关。当我加载data
并执行setkey
时:
setkey(data, "month", "portal")
# > data
# month portal satisfaction
# 1: 2012-12-01 TRUE 10
# 2: 2012-12-01 FALSE 9
# 3: 2012-12-01 FALSE 8
# 4: 2012-12-01 TRUE 2
# 5: 2012-12-01 TRUE 10
# 6: 2012-12-01 TRUE 10
# 7: 2013-01-01 TRUE 10
# 8: 2013-01-01 TRUE 10
# 9: 2013-01-01 TRUE 10
# 10: 2013-01-01 TRUE 10
# 11: 2013-01-01 TRUE 10
# 12: 2013-01-01 TRUE 10
# 13: 2013-01-01 FALSE 10
# 14: 2013-02-01 TRUE 9
# 15: 2013-02-01 TRUE 9
# 16: 2013-02-01 FALSE 9
# 17: 2013-02-01 TRUE 10
# 18: 2013-02-01 TRUE 10
# 19: 2013-02-01 TRUE 10
# 20: 2013-02-01 TRUE 10
# 21: 2013-02-01 TRUE 10
# 22: 2013-02-01 TRUE 9
# 23: 2013-02-01 TRUE 10
# 24: 2013-02-01 TRUE 9
# 25: 2013-02-01 TRUE 9
# month portal satisfaction
您看到portal
列未正确排序。当我再次setkey
时,
setkey(data, "month", "portal")
# I get this warning message:
Warning message:
In setkeyv(x, cols, verbose = verbose) :
Already keyed by this key but had invalid row order, key rebuilt.
If you didn't go under the hood please let datatable-help know so
the root cause can be fixed.
现在,data
列似乎按键列正确排序:
# > data
# month portal satisfaction
# 1: 2012-12-01 FALSE 9
# 2: 2012-12-01 FALSE 8
# 3: 2012-12-01 TRUE 10
# 4: 2012-12-01 TRUE 2
# 5: 2012-12-01 TRUE 10
# 6: 2012-12-01 TRUE 10
# 7: 2013-01-01 FALSE 10
# 8: 2013-01-01 TRUE 10
# 9: 2013-01-01 TRUE 10
# 10: 2013-01-01 TRUE 10
# 11: 2013-01-01 TRUE 10
# 12: 2013-01-01 TRUE 10
# 13: 2013-01-01 TRUE 10
# 14: 2013-02-01 FALSE 9
# 15: 2013-02-01 TRUE 9
# 16: 2013-02-01 TRUE 9
# 17: 2013-02-01 TRUE 10
# 18: 2013-02-01 TRUE 10
# 19: 2013-02-01 TRUE 10
# 20: 2013-02-01 TRUE 10
# 21: 2013-02-01 TRUE 10
# 22: 2013-02-01 TRUE 9
# 23: 2013-02-01 TRUE 10
# 24: 2013-02-01 TRUE 9
# 25: 2013-02-01 TRUE 9
# month portal satisfaction
因此,对POSIXct + logical
类型进行排序似乎是一个问题?
data[, mean(satisfaction), by=list(month, portal)]
# month portal V1
# 1: 2012-12-01 FALSE 8.500000
# 2: 2012-12-01 TRUE 8.000000
# 3: 2013-01-01 FALSE 10.000000
# 4: 2013-01-01 TRUE 10.000000
# 5: 2013-02-01 FALSE 9.000000
# 6: 2013-02-01 TRUE 9.545455
因此我认为有一个错误。