查找每个分组变量的最大值并转换为新变量

时间:2017-04-16 19:01:09

标签: r

我有以下数据集,我想识别每个customer_ID具有最高金额的产品,并将其转换为新列。我还希望每个ID只保留一条记录。

生成数据集的数据:

x <- data.frame(customer_id=c(1,1,1,2,2,2), product=c("a","b","c","a","b","c"), amount=c(50,125,100,75,110,150))

实际数据集如下所示:

customer_id product amount 1 a 50 1 b 125 1 c 100 2 a 75 2 b 110 2 c 150

想要的输出应该如下所示:

customer_ID product_b product_c 1 125 0 2 0 150

2 个答案:

答案 0 :(得分:2)

我们可以使用tidyverse执行此操作。按“customer_id”分组后,slice具有最大“金额”的行,paste带有前缀('product_')到'product'列(如果需要)和spread到宽幅

library(dplyr)
library(tidyr)
x %>%
   group_by(customer_id) %>% 
   slice(which.max(amount)) %>% 
   mutate(product = paste0("product_", product)) %>%
   spread(product, amount, fill = 0)
#  customer_id product_b product_c
#*       <dbl>     <dbl>     <dbl>
#1           1       125         0
#2           2         0       150

另一种选择是arrange数据集按'customer_id'和'amount'降序排列,得到基于'customer_id'的distinct行和'spread to'wide'

arrange(x, customer_id, desc(amount)) %>%
        distinct(customer_id, .keep_all = TRUE) %>% 
        spread(customer_id, amount, fill = 0)

答案 1 :(得分:1)

使用reshape2包,

library(reshape2)

x1 <- x[!!with(x, ave(amount, customer_id, FUN = function(i) i == max(i))),]

dcast(x1, customer_id ~ product, value.var = 'amount', fill = 0)
#  customer_id   b   c
#1           1 125   0
#2           2   0 150