data.table中的新行为? .N /``by`(计算比例)

时间:2014-12-31 03:43:46

标签: r data.table

我从最近的中期版本(我认为是1.8.X)更新到data.table - 1.9.4的最新版本,现在我得到了一些意想不到的行为。

set.seed(12312014)

# a vector of letters a:e, each repeated between 1 and 10 times
type <- unlist(mapply(rep, letters[1:5], round(runif(5, 1, 10), 0)))

# a random vector of 3 categories
category <- sample(c('small', 'med', 'large'), length(type), replace=T)
my_dt <- data.table(type, category)

说我想按类别分类。我曾经这样做过:

my_dt[, type_n:=.N, by=type]
my_dt[, .N/type_n, by=.(type, category)][order(type, category)]

我得到的data.table 1.9.4:

# type category        V1
# 1:    a    large 0.2500000
# 2:    a    large 0.2500000
# 3:    a      med 0.2500000
# 4:    a      med 0.2500000
# 5:    a    small 0.5000000
# 6:    a    small 0.5000000
# 7:    a    small 0.5000000
# 8:    a    small 0.5000000
# 9:    b    large 0.4285714
# 10:    b    large 0.4285714
# 11:    b    large 0.4285714
# 12:    b      med 0.4285714
# (...and so on, 42 rows long)

但我以前得到的,我几乎可以肯定,这是(猫的简单比例):

# type category        V1
# 1:    a    large 0.2500000
# 2:    a      med 0.2500000
# 3:    a    small 0.5000000
# 4:    b    large 0.4285714
# 5:    b      med 0.4285714
# 6:    b    small 0.1428571
# 7:    c    large 0.3000000
# 8:    c      med 0.1000000
# 9:    c    small 0.6000000
# 10:    d    large 0.2222222
# 11:    d      med 0.6666667
# 12:    d    small 0.1111111
# 13:    e    large 0.3750000
# 14:    e      med 0.3750000
# 15:    e    small 0.2500000

我可以用这个得到理想的结果:

unique(my_dt[, .N/type_n, by=.(type, category)][order(type, category)])

...但我想知道新的data.table语法中是否有一种首选方式。我知道我也可以使用prop.table,但我想要长格式。

prop.table(table(my_dt), margin=1)
# category
# type     large       med     small
#    a 0.2500000 0.2500000 0.5000000
#    b 0.4285714 0.4285714 0.1428571
#    c 0.3000000 0.1000000 0.6000000
#    d 0.2222222 0.6666667 0.1111111
#    e 0.3750000 0.3750000 0.2500000

作为参考,我的sessionInfo调用给出:

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_1.0.0    data.table_1.9.4

loaded via a namespace (and not attached):
 [1] chron_2.3-45     colorspace_1.2-4 digest_0.6.4     grid_3.1.1       gtable_0.1.2     labeling_0.2    
 [7] MASS_7.3-33      munsell_0.4.2    plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2      reshape2_1.4    
[13] scales_0.2.4     stringr_0.6.2    tools_3.1.1    

1 个答案:

答案 0 :(得分:2)

可以尝试

my_dt[, .N, by=.(type,category)][, prop:=N/sum(N), by=type][]

    type category N      prop
 1:    a    small 4 0.5000000
 2:    a      med 2 0.2500000
 3:    a    large 2 0.2500000
 4:    b      med 3 0.4285714
 5:    b    large 3 0.4285714
 6:    b    small 1 0.1428571
 7:    c    large 3 0.3000000
 8:    c    small 6 0.6000000
 9:    c      med 1 0.1000000
10:    d      med 6 0.6666667
11:    d    large 2 0.2222222
12:    d    small 1 0.1111111
13:    e    small 2 0.2500000
14:    e      med 3 0.3750000
15:    e    large 3 0.3750000