让我的数据框有2列,客户ID & 交易金额。现在,对于每个唯一的客户ID,我想找到交易金额(按降序排序)&然后从排序列中我将找到排序列表的前三个交易的平均交易金额。
Cust_id trans_amount
12345 100
12345 200
12345 170
12345 300
12345 250
12456 140
12456 240
12456 160
12456 100
我正在寻找的格式是,
Cust_id trans_amount
12345 300
12345 250
12345 200
12345 170
12345 100
12456 240
12456 160
12456 140
12456 100
并从那里得到前三名的意思,即
Cust_id mean_for_top_3
12345 250
12456 180
对于中间部分,我试过了,
ddply(cust_data,.(cust_id.),summarize,sorted_amount=sort(trans_amount,,decreasing=TRUE))
但没有得到结果。请告知我如何达到我想要的输出。
答案 0 :(得分:3)
使用data.table
的解决方案:
library(data.table)
setDT(cust_data)
cust_data_sort <- cust_data[, .(trans_amount = sort(trans_amount, decreasing = TRUE)), Cust_id]
cust_data_sort[, .(mean_for_top_3 = mean(head(trans_amount, 3))), Cust_id]
Cust_id mean_for_top_3
1: 12345 250
2: 12456 180
如果您不需要排序表cust_data_sort
,那么您可以使用它来表达意思:
cust_data[, .(mean_for_top_3 = mean(head(sort(trans_amount, decreasing = TRUE), 3))), Cust_id]
答案 1 :(得分:1)
使用dplyr
df <- read.table(text = "Cust_id trans_amount
12345 100
12345 200
12345 170
12345 300
12345 250
12456 140
12456 240
12456 160
12456 100 ", header = T)
library(dplyr)
df %>% group_by(Cust_id) %>%
arrange(desc(trans_amount), .by_group = T) %>%
top_n(n = 3) %>%
summarize(mean = mean(trans_amount))
# A tibble: 2 x 2
Cust_id mean
<int> <dbl>
1 12345 250
2 12456 180
替代计数:
> df %>% group_by(Cust_id) %>%
+ #arrange(desc(trans_amount), .by_group = T) %>%
+ mutate(count = n()) %>%
+ top_n(n = 3, wt = trans_amount) %>%
+ mutate(mean = mean(trans_amount)) %>%
+ select(Cust_id,mean,count) %>% distinct()
# A tibble: 2 x 3
# Groups: Cust_id [2]
Cust_id mean count
<int> <dbl> <int>
1 12345 250 5
2 12456 180 4
>