我从最近的中期版本(我认为是1.8.X)更新到data.table
- 1.9.4的最新版本,现在我得到了一些意想不到的行为。
set.seed(12312014)
# a vector of letters a:e, each repeated between 1 and 10 times
type <- unlist(mapply(rep, letters[1:5], round(runif(5, 1, 10), 0)))
# a random vector of 3 categories
category <- sample(c('small', 'med', 'large'), length(type), replace=T)
my_dt <- data.table(type, category)
说我想按类别分类。我曾经这样做过:
my_dt[, type_n:=.N, by=type]
my_dt[, .N/type_n, by=.(type, category)][order(type, category)]
我得到的data.table 1.9.4:
# type category V1
# 1: a large 0.2500000
# 2: a large 0.2500000
# 3: a med 0.2500000
# 4: a med 0.2500000
# 5: a small 0.5000000
# 6: a small 0.5000000
# 7: a small 0.5000000
# 8: a small 0.5000000
# 9: b large 0.4285714
# 10: b large 0.4285714
# 11: b large 0.4285714
# 12: b med 0.4285714
# (...and so on, 42 rows long)
但我以前得到的,我几乎可以肯定,这是(猫的简单比例):
# type category V1
# 1: a large 0.2500000
# 2: a med 0.2500000
# 3: a small 0.5000000
# 4: b large 0.4285714
# 5: b med 0.4285714
# 6: b small 0.1428571
# 7: c large 0.3000000
# 8: c med 0.1000000
# 9: c small 0.6000000
# 10: d large 0.2222222
# 11: d med 0.6666667
# 12: d small 0.1111111
# 13: e large 0.3750000
# 14: e med 0.3750000
# 15: e small 0.2500000
我可以用这个得到理想的结果:
unique(my_dt[, .N/type_n, by=.(type, category)][order(type, category)])
...但我想知道新的data.table语法中是否有一种首选方式。我知道我也可以使用prop.table
,但我想要长格式。
prop.table(table(my_dt), margin=1)
# category
# type large med small
# a 0.2500000 0.2500000 0.5000000
# b 0.4285714 0.4285714 0.1428571
# c 0.3000000 0.1000000 0.6000000
# d 0.2222222 0.6666667 0.1111111
# e 0.3750000 0.3750000 0.2500000
作为参考,我的sessionInfo调用给出:
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_1.0.0 data.table_1.9.4
loaded via a namespace (and not attached):
[1] chron_2.3-45 colorspace_1.2-4 digest_0.6.4 grid_3.1.1 gtable_0.1.2 labeling_0.2
[7] MASS_7.3-33 munsell_0.4.2 plyr_1.8.1 proto_0.3-10 Rcpp_0.11.2 reshape2_1.4
[13] scales_0.2.4 stringr_0.6.2 tools_3.1.1
答案 0 :(得分:2)
可以尝试
my_dt[, .N, by=.(type,category)][, prop:=N/sum(N), by=type][]
type category N prop
1: a small 4 0.5000000
2: a med 2 0.2500000
3: a large 2 0.2500000
4: b med 3 0.4285714
5: b large 3 0.4285714
6: b small 1 0.1428571
7: c large 3 0.3000000
8: c small 6 0.6000000
9: c med 1 0.1000000
10: d med 6 0.6666667
11: d large 2 0.2222222
12: d small 1 0.1111111
13: e small 2 0.2500000
14: e med 3 0.3750000
15: e large 3 0.3750000