我们想说,我想为每个购买类别找到前3个最常出现的邮政编码。在此示例中,类别为 home , townhouse 和 condo 。我有交易数据,如:
set.seed(1234)
d <- data.table(purch_id = 1:3e6,
purch_cat = sample(x = c('home','townhouse','condo'),
size = 3e6, replace=TRUE),
purch_zip = formatC( sample(x = 1e4:9e4, size = 3e6, replace=TRUE),
width = 5, format = "d", flag = "0") )
我知道我可以这样做:
# there has to be a better way...
d[,list(purch_count = length(purch_id)),
by=list(purch_cat, purch_zip)][, purch_rank := rank(-purch_count, ties.method='min'),
by=purch_cat][purch_rank<=3,][order(purch_cat, purch_rank)]
purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2
但这不是最优雅的data.table
方法,而且似乎有点慢。
任何减少通行证数量的建议?也许使用table()
的东西? TYVM!
答案 0 :(得分:7)
一种方法是
d[ , .N, by=.(purch_cat, purch_zip)][
order(-N),
.SD[ N >= unique(N)[3] ]
,by=purch_cat]
给出了
purch_cat purch_zip N
1: townhouse 39587 33
2: townhouse 80365 30
3: townhouse 37360 30
4: townhouse 83099 28
5: townhouse 33518 28
6: townhouse 59347 28
7: townhouse 22402 28
8: condo 39169 32
9: condo 15725 31
10: condo 75768 30
11: condo 72023 30
12: home 71294 30
13: home 56053 30
14: home 57971 29
15: home 77521 29
16: home 70124 29
17: home 25302 29
18: home 65292 29
19: home 39488 29
20: home 81754 28
21: home 43426 28
22: home 16943 28
23: home 88978 28
24: home 43003 28
25: home 76501 28
purch_cat purch_zip N
要实现OP的打破平局规则,可以做到
d[ , .N, by=.(purch_cat,purch_zip)][
order(-N),
.SD[ N >= unique(N)[3] ][
.N - frank(N, ties.method='max') < 3 ]
, by=purch_cat]
给出了
purch_cat purch_zip N
1: townhouse 39587 33
2: townhouse 80365 30
3: townhouse 37360 30
4: condo 39169 32
5: condo 15725 31
6: condo 75768 30
7: condo 72023 30
8: home 71294 30
9: home 56053 30
10: home 57971 29
11: home 77521 29
12: home 70124 29
13: home 25302 29
14: home 65292 29
15: home 39488 29
按照@ MichaelChirico的回答,这种方法会增加frank
步。
答案 1 :(得分:5)
我认为你正好走在正确的轨道上。但是,您遗漏的一个关键因素是函数frank
,它已经过优化,可以大大加快您的代码速度(几乎可以立即在您的3米行样本数据上运行):
d[ , .(purch_count = .N),
by = .(purch_cat, purch_zip)
][, purch_rank := frank(-purch_count, ties.method = 'min'),
keyby = purch_cat
][purch_rank <= 3,
][order(purch_cat, purch_rank)]
purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2
table
(慢)的答案不完整:是的,一种方法涉及使用table
。
d[ , {x <- table(purch_zip)
x <- x[order(-x)]
names(x[x %in% unique(x)[1:3]])
}, keyby = purch_cat]
purch_cat V1
1: condo 39169
2: condo 15725
3: condo 72023
4: condo 75768
5: home 56053
6: home 71294
7: home 25302
8: home 39488
9: home 57971
10: home 65292
11: home 70124
12: home 77521
13: home 16943
14: home 43003
15: home 43426
16: home 76501
17: home 81754
18: home 88978
19: townhouse 39587
20: townhouse 37360
21: townhouse 80365
22: townhouse 22402
23: townhouse 33518
24: townhouse 59347
25: townhouse 83099
purch_cat V1