data.table没有两列正确汇总

时间:2013-02-25 21:54:13

标签: r data.table

data.table我遇到了这个问题,最近让我发疯了。它看起来像一个bug但可能是我在这里遗漏了一些明显的东西。

我有以下数据框:

# First some data
data <- data.table(structure(list(
  month = structure(c(1356998400, 1356998400, 1356998400, 
                      1359676800, 1354320000, 1359676800, 1359676800, 1356998400, 1356998400, 
                      1354320000, 1354320000, 1354320000, 1359676800, 1359676800, 1359676800, 
                      1356998400, 1359676800, 1359676800, 1356998400, 1359676800, 1359676800, 
                      1359676800, 1359676800, 1354320000, 1354320000), class = c("POSIXct", 
                                                                                 "POSIXt"), tzone = "UTC"), 
  portal = c(TRUE, TRUE, FALSE, TRUE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
  ), 
  satisfaction = c(10L, 10L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 
                   9L, 2L, 8L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 10L, 9L, 10L, 9L, 
                   10L, 10L)), 
                  .Names = c("month", "portal", "satisfaction"), 
                  row.names = c(NA, -25L), class = "data.frame"))

我想通过portalmonth对其进行总结。总结良好的旧tapply按预期工作 - 我得到3x2矩阵,结果显示2012年12月和2013年1月至2月:

> tapply(data$satisfaction, list(data$month, data$portal), mean)
           FALSE      TRUE
2012-12-01   8.5  8.000000
2013-01-01  10.0 10.000000
2013-02-01   9.0  9.545455

总结by的{​​{1}}参数不会:

data.table

如您所见,它会返回一个数据表,其中包含 8 值,而不是 6 ;例如,> data[, mean(satisfaction), by = 'month,portal'] month portal V1 1: 2013-01-01 FALSE 10.000000 2: 2013-02-01 TRUE 9.000000 3: 2013-01-01 TRUE 10.000000 4: 2012-12-01 FALSE 8.500000 5: 2012-12-01 TRUE 7.333333 6: 2013-02-01 TRUE 9.666667 7: 2013-02-01 FALSE 9.000000 8: 2012-12-01 TRUE 10.000000 portal == TRUE重复的值。

有趣的是,如果我将这仅限于2013年的数据,那么一切都按预期进行:

month == 2012-02-01

我很困惑超越相信:)。有人可以帮帮我吗?

1 个答案:

答案 0 :(得分:5)

问题似乎与排序有关。当我加载data并执行setkey时:

setkey(data, "month", "portal")

# > data
#          month portal satisfaction
#  1: 2012-12-01   TRUE           10
#  2: 2012-12-01  FALSE            9
#  3: 2012-12-01  FALSE            8
#  4: 2012-12-01   TRUE            2
#  5: 2012-12-01   TRUE           10
#  6: 2012-12-01   TRUE           10
#  7: 2013-01-01   TRUE           10
#  8: 2013-01-01   TRUE           10
#  9: 2013-01-01   TRUE           10
# 10: 2013-01-01   TRUE           10
# 11: 2013-01-01   TRUE           10
# 12: 2013-01-01   TRUE           10
# 13: 2013-01-01  FALSE           10
# 14: 2013-02-01   TRUE            9
# 15: 2013-02-01   TRUE            9
# 16: 2013-02-01  FALSE            9
# 17: 2013-02-01   TRUE           10
# 18: 2013-02-01   TRUE           10
# 19: 2013-02-01   TRUE           10
# 20: 2013-02-01   TRUE           10
# 21: 2013-02-01   TRUE           10
# 22: 2013-02-01   TRUE            9
# 23: 2013-02-01   TRUE           10
# 24: 2013-02-01   TRUE            9
# 25: 2013-02-01   TRUE            9
#          month portal satisfaction

您看到portal列未正确排序。当我再次setkey时,

setkey(data, "month", "portal")

# I get this warning message:
Warning message:
In setkeyv(x, cols, verbose = verbose) :
  Already keyed by this key but had invalid row order, key rebuilt. 
  If you didn't go under the hood please let datatable-help know so 
  the root cause can be fixed.

现在,data列似乎按键列正确排序:

# > data
#          month portal satisfaction
#  1: 2012-12-01  FALSE            9
#  2: 2012-12-01  FALSE            8
#  3: 2012-12-01   TRUE           10
#  4: 2012-12-01   TRUE            2
#  5: 2012-12-01   TRUE           10
#  6: 2012-12-01   TRUE           10
#  7: 2013-01-01  FALSE           10
#  8: 2013-01-01   TRUE           10
#  9: 2013-01-01   TRUE           10
# 10: 2013-01-01   TRUE           10
# 11: 2013-01-01   TRUE           10
# 12: 2013-01-01   TRUE           10
# 13: 2013-01-01   TRUE           10
# 14: 2013-02-01  FALSE            9
# 15: 2013-02-01   TRUE            9
# 16: 2013-02-01   TRUE            9
# 17: 2013-02-01   TRUE           10
# 18: 2013-02-01   TRUE           10
# 19: 2013-02-01   TRUE           10
# 20: 2013-02-01   TRUE           10
# 21: 2013-02-01   TRUE           10
# 22: 2013-02-01   TRUE            9
# 23: 2013-02-01   TRUE           10
# 24: 2013-02-01   TRUE            9
# 25: 2013-02-01   TRUE            9
#          month portal satisfaction

因此,对POSIXct + logical类型进行排序似乎是一个问题?

data[, mean(satisfaction), by=list(month, portal)]

#         month portal        V1
# 1: 2012-12-01  FALSE  8.500000
# 2: 2012-12-01   TRUE  8.000000
# 3: 2013-01-01  FALSE 10.000000
# 4: 2013-01-01   TRUE 10.000000
# 5: 2013-02-01  FALSE  9.000000
# 6: 2013-02-01   TRUE  9.545455

因此我认为有一个错误。