我有以下格式的数据框:
df <- data.frame(client = c("client1", "client1", "client2", "client3", "client3"),
product = c("A", "B", "A", "D", "A"),
purchase_Date = c("2010-03-22", "2010-02-02", "2009-03-02", "2011-04-05", "2012-11-01"))
df$purchase_Date <- as.Date(df$purchase_Date, format = "%Y-%m-%d")
看起来像这样:
client product purchase_Date
1 client1 A 2010-03-02
2 client1 B 2010-02-02
3 client2 A 2009-03-02
4 client3 D 2011-04-05
5 client3 A 2012-11-01
我想像这样重新安排:
client purchase1 purchase2
1 client1 B A
2 client2 A <NA>
3 client3 D A
所以我想知道哪个产品是第一个,第二个,第三个等等,每个人都按购买日期订购。我可以使用data.table:
轻松地分别获取每一个library(data.table)
setDT(df)[ , .SD[order(-purchase_Date), product][1], by = client]
第一个。但我不知道如何有效地获得所需的输出。
答案 0 :(得分:7)
此处有一个data.table
可能的解决方案(如果您购买的商品超过10个,那么我建议您避免使用paste0
,而只需使用indx := seq_len(.N)
即可可能会破坏采购订单)
setDT(df)[order(purchase_Date), indx := paste0("purchase", seq_len(.N)), by = client]
dcast(df, client ~ indx, value.var = "product")
# client purchase1 purchase2
# 1: client1 B A
# 2: client2 A NA
# 3: client3 D A
创建frank()
col的order()
和indx
方法之间的比较:
require(data.table)
set.seed(45L);
dt = data.table(client = sample(paste("client", 1:1e4, sep=""), 1e6, TRUE))
dt[, `:=`(product = sample(paste("p", 1:200, sep=""), .N, FALSE),
purchase_Date = as.Date(sample(14610:16586, .N, FALSE),
origin = "1970-01-01")), by=client]
system.time(dt[order(purchase_Date), indx := seq_len(.N), by = client])
# user system elapsed
# 0.19 0.02 0.20
system.time(dt[, purch_rank := frank(purchase_Date, ties.method = "dense"), by=client])
# user system elapsed
# 3.94 0.00 3.98
答案 1 :(得分:4)
dplyr / tidyr方法:
library(dplyr)
library(tidyr)
df %>%
group_by(client) %>%
mutate(purch_rank = dense_rank(purchase_Date)) %>%
select(-purchase_Date) %>%
spread(purch_rank, product)
#Source: local data frame [3 x 3]
#
# client 1 2
#1 client1 B A
#2 client2 A NA
#3 client3 D A
可能的data.table方法:
library(data.table) #v 1.9.5+ currently from GitHub for "frank"
setDT(df)[, purch_rank := frank(purchase_Date, ties.method = "dense"), by=client]
dcast(df, client ~ purch_rank, value.var = "product")
# client 1 2
#1: client1 B A
#2: client2 A NA
#3: client3 D A
答案 2 :(得分:0)
以下是dplyr
和tidyr
的解决方案:
df %>%
group_by(client) %>%
select(-purchase_Date) %>%
mutate(purchase = seq_along(product)) %>%
spread(purchase, product)
Source: local data frame [3 x 3]
client 1 2
1 client1 A B
2 client2 A NA
3 client3 D A
使用reshape2
包时,使用不同输出的略有不同的方法。只需使用前面的代码,但最后一行将被这一行代替:
dcast(client ~ product)
Using purchase as value column: use value.var to override.
client A B D
1 client1 1 2 NA
2 client2 1 NA NA
3 client3 2 NA 1