data.table按组获取N个最常见的值

时间:2015-09-01 19:01:10

标签: r data.table

我们想说,我想为每个购买类别找到前3个最常出现的邮政编码。在此示例中,类别为 home townhouse condo 。我有交易数据,如:

set.seed(1234)
d <- data.table(purch_id = 1:3e6,
                purch_cat = sample(x = c('home','townhouse','condo'), 
                                   size = 3e6, replace=TRUE),
                purch_zip = formatC( sample(x = 1e4:9e4, size = 3e6, replace=TRUE), 
                                     width = 5, format = "d", flag = "0") )

我知道我可以这样做:

# there has to be a better way...
d[,list(purch_count = length(purch_id)), 
  by=list(purch_cat, purch_zip)][, purch_rank := rank(-purch_count, ties.method='min'), 
                                 by=purch_cat][purch_rank<=3,][order(purch_cat, purch_rank)]

    purch_cat purch_zip purch_count purch_rank
 1:     condo     39169          32          1
 2:     condo     15725          31          2
 3:     condo     75768          30          3
 4:     condo     72023          30          3
 5:      home     71294          30          1
 6:      home     56053          30          1
 7:      home     57971          29          3
 8:      home     77521          29          3
 9:      home     70124          29          3
10:      home     25302          29          3
11:      home     65292          29          3
12:      home     39488          29          3
13: townhouse     39587          33          1
14: townhouse     80365          30          2
15: townhouse     37360          30          2

但这不是最优雅的data.table方法,而且似乎有点慢。

任何减少通行证数量的建议?也许使用table()的东西? TYVM!

2 个答案:

答案 0 :(得分:7)

一种方法是

d[ , .N, by=.(purch_cat, purch_zip)][
  order(-N), 
  .SD[ N >= unique(N)[3] ]
,by=purch_cat]

给出了

    purch_cat purch_zip  N
 1: townhouse     39587 33
 2: townhouse     80365 30
 3: townhouse     37360 30
 4: townhouse     83099 28
 5: townhouse     33518 28
 6: townhouse     59347 28
 7: townhouse     22402 28
 8:     condo     39169 32
 9:     condo     15725 31
10:     condo     75768 30
11:     condo     72023 30
12:      home     71294 30
13:      home     56053 30
14:      home     57971 29
15:      home     77521 29
16:      home     70124 29
17:      home     25302 29
18:      home     65292 29
19:      home     39488 29
20:      home     81754 28
21:      home     43426 28
22:      home     16943 28
23:      home     88978 28
24:      home     43003 28
25:      home     76501 28
    purch_cat purch_zip  N

要实现OP的打破平局规则,可以做到

d[ , .N, by=.(purch_cat,purch_zip)][
  order(-N),
  .SD[ N >= unique(N)[3] ][
      .N - frank(N, ties.method='max') < 3 ]
, by=purch_cat]

给出了

    purch_cat purch_zip  N
 1: townhouse     39587 33
 2: townhouse     80365 30
 3: townhouse     37360 30
 4:     condo     39169 32
 5:     condo     15725 31
 6:     condo     75768 30
 7:     condo     72023 30
 8:      home     71294 30
 9:      home     56053 30
10:      home     57971 29
11:      home     77521 29
12:      home     70124 29
13:      home     25302 29
14:      home     65292 29
15:      home     39488 29

按照@ MichaelChirico的回答,这种方法会增加frank步。

答案 1 :(得分:5)

编辑:改进。

我认为你正好走在正确的轨道上。但是,您遗漏的一个关键因素是函数frank,它已经过优化,可以大大加快您的代码速度(几乎可以立即在您的3米行样本数据上运行):

d[ , .(purch_count = .N), 
  by = .(purch_cat, purch_zip)
  ][, purch_rank := frank(-purch_count, ties.method = 'min'), 
    keyby = purch_cat
    ][purch_rank <= 3,
      ][order(purch_cat, purch_rank)]
    purch_cat purch_zip purch_count purch_rank
 1:     condo     39169          32          1
 2:     condo     15725          31          2
 3:     condo     75768          30          3
 4:     condo     72023          30          3
 5:      home     71294          30          1
 6:      home     56053          30          1
 7:      home     57971          29          3
 8:      home     77521          29          3
 9:      home     70124          29          3
10:      home     25302          29          3
11:      home     65292          29          3
12:      home     39488          29          3
13: townhouse     39587          33          1
14: townhouse     80365          30          2
15: townhouse     37360          30          2

table(慢)的答案不完整:

是的,一种方法涉及使用table

d[ , {x <- table(purch_zip)
x <- x[order(-x)]
names(x[x %in% unique(x)[1:3]])
}, keyby = purch_cat]
    purch_cat    V1
 1:     condo 39169
 2:     condo 15725
 3:     condo 72023
 4:     condo 75768
 5:      home 56053
 6:      home 71294
 7:      home 25302
 8:      home 39488
 9:      home 57971
10:      home 65292
11:      home 70124
12:      home 77521
13:      home 16943
14:      home 43003
15:      home 43426
16:      home 76501
17:      home 81754
18:      home 88978
19: townhouse 39587
20: townhouse 37360
21: townhouse 80365
22: townhouse 22402
23: townhouse 33518
24: townhouse 59347
25: townhouse 83099
    purch_cat    V1