重新排列data.frame以获得产品的顺序

时间:2015-06-23 15:03:25

标签: r date

我有以下格式的数据框:

df <- data.frame(client = c("client1", "client1", "client2", "client3", "client3"),
                 product = c("A", "B", "A", "D", "A"),
                 purchase_Date = c("2010-03-22", "2010-02-02", "2009-03-02", "2011-04-05", "2012-11-01"))
df$purchase_Date <- as.Date(df$purchase_Date, format = "%Y-%m-%d")

看起来像这样:

   client product purchase_Date
1 client1       A    2010-03-02
2 client1       B    2010-02-02
3 client2       A    2009-03-02
4 client3       D    2011-04-05
5 client3       A    2012-11-01

我想像这样重新安排:

   client purchase1 purchase2
1 client1         B         A
2 client2         A      <NA>
3 client3         D         A

所以我想知道哪个产品是第一个,第二个,第三个等等,每个人都按购买日期订购。我可以使用data.table:

轻松地分别获取每一个
library(data.table)
setDT(df)[ , .SD[order(-purchase_Date), product][1], by = client]

第一个。但我不知道如何有效地获得所需的输出。

3 个答案:

答案 0 :(得分:7)

此处有一个data.table可能的解决方案(如果您购买的商品超过10个,那么我建议您避免使用paste0,而只需使用indx := seq_len(.N)即可可能会破坏采购订单)

setDT(df)[order(purchase_Date), indx := paste0("purchase", seq_len(.N)), by = client]
dcast(df, client ~ indx, value.var = "product")
#     client purchase1 purchase2
# 1: client1         B         A
# 2: client2         A        NA
# 3: client3         D         A

创建frank() col的order()indx方法之间的比较:

require(data.table)
set.seed(45L); 
dt = data.table(client = sample(paste("client", 1:1e4, sep=""), 1e6, TRUE))
dt[, `:=`(product = sample(paste("p", 1:200, sep=""), .N, FALSE), 
          purchase_Date = as.Date(sample(14610:16586, .N, FALSE), 
           origin = "1970-01-01")), by=client]

system.time(dt[order(purchase_Date), indx := seq_len(.N), by = client])
# user  system elapsed 
# 0.19    0.02    0.20 
system.time(dt[, purch_rank := frank(purchase_Date, ties.method = "dense"), by=client])
# user  system elapsed 
# 3.94    0.00    3.98 

答案 1 :(得分:4)

dplyr / tidyr方法:

library(dplyr)
library(tidyr)

df %>%
  group_by(client) %>%
  mutate(purch_rank = dense_rank(purchase_Date)) %>%
  select(-purchase_Date) %>%
  spread(purch_rank, product)
#Source: local data frame [3 x 3]
#
#   client 1  2
#1 client1 B  A
#2 client2 A NA
#3 client3 D  A

可能的data.table方法:

library(data.table) #v 1.9.5+ currently from GitHub for "frank"
setDT(df)[, purch_rank := frank(purchase_Date, ties.method = "dense"), by=client]
dcast(df, client ~ purch_rank, value.var = "product")
#    client 1  2
#1: client1 B  A
#2: client2 A NA
#3: client3 D  A

答案 2 :(得分:0)

以下是dplyrtidyr的解决方案:

df %>%
  group_by(client) %>%
  select(-purchase_Date) %>%
  mutate(purchase = seq_along(product)) %>%
  spread(purchase, product)
Source: local data frame [3 x 3]

   client 1  2
1 client1 A  B
2 client2 A NA
3 client3 D  A

使用reshape2包时,使用不同输出的略有不同的方法。只需使用前面的代码,但最后一行将被这一行代替:

dcast(client ~ product)
Using purchase as value column: use value.var to override.
   client A  B  D
1 client1 1  2 NA
2 client2 1 NA NA
3 client3 2 NA  1