我有一个包含三列的数据框,用于捕获交易数据,包括CustomerName,OrderDate和已购买产品的名称。我必须将数据帧转换为另一个数据帧,其格式使得客户在单个日期购买的所有项目都在一行中。
当我处理大型数据集时,是否有一种有效的方法来进行此转换,希望不使用for循环。
此外,数据框中产品的列数必须等于任何客户在任何一天购买的产品的最大数量。请在转换前后找到数据框的示例
原始数据:
data <- data.frame(Customer = c("John", "John", "John", "Tom", "Tom", "Tom", "Sally", "Sally", "Sally", "Sally"),
OrderDate = c("1-Oct", "2-Oct", "2-Oct", "2-Oct","2-Oct", "2-Oct", "3-Oct", "3-Oct", "3-Oct", "3-Oct"),
Product = c("Milk", "Eggs", "Bread", "Chicken", "Pizza", "Beer", "Salad", "Apples", "Eggs", "Wine"),
stringsAsFactors = FALSE)
# Customer OrderDate Product
# 1 John 1-Oct Milk
# 2 John 2-Oct Eggs
# 3 John 2-Oct Bread
# 4 Tom 2-Oct Chicken
# 5 Tom 2-Oct Pizza
# 6 Tom 2-Oct Beer
# 7 Sally 3-Oct Salad
# 8 Sally 3-Oct Apples
# 9 Sally 3-Oct Eggs
# 10 Sally 3-Oct Wine
后穿越 - :
datatransform <- as.data.frame(matrix(NA, nrow = 4, ncol = 6))
colnames(datatransform) <- c("Customer", "OrderDate", "Product1", "Product2", "Product3", "Product4")
datatransform$Customer <- c("John", "John", "Tom", "Sally")
datatransform$OrderDate <- c("1-Oct", "2-Oct", "2-Oct", "3-Oct")
datatransform[1, 3:6] <- c("Milk", "", "", "")
datatransform[2, 3:6 ] <- c("Eggs", "Bread", "", "")
datatransform[3, 3:6 ] <- c("Chicken", "Pizza", "Beer", "")
datatransform[4, 3:6 ] <- c("Salad", "Apples", "Eggs", "Wine")
# Customer OrderDate Product1 Product2 Product3 Product4
# 1 John 1-Oct Milk
# 2 John 2-Oct Eggs Bread
# 3 Tom 2-Oct Chicken Pizza Beer
# 4 Sally 3-Oct Salad Apples Eggs Wine
此外,数据框中产品的列数必须等于任何客户在任何一天购买的产品的最大数量。
答案 0 :(得分:0)
既然你谈到了大数据集(那么效率是一个非常重要的问题需要考虑),这里有一个dplyr和reshape2解决方案:
library(reshape2)
library(dplyr)
data %>% group_by(Customer, OrderDate) %>%
mutate(ProductValue = paste0("Product", 1:n()) ) %>%
dcast(Customer + OrderDate ~ ProductValue, value.var = "Product" ) %>%
arrange(OrderDate)
Customer OrderDate Product1 Product2 Product3 Product4
1 John 1-Oct Milk <NA> <NA> <NA>
2 John 2-Oct Eggs Bread <NA> <NA>
3 Tom 2-Oct Chicken Pizza Beer <NA>
4 Sally 3-Oct Salad Apples Eggs Wine