民间,
我很难接受以下挑战。我有一个如下所示的数据集:
BuyerID Fruit.1 Fruit.2 Fruit.3 Amount.1 Amount.2 Amount.3
879 Banana Apple 4 3
765 Strawberry Apple Orange 1 2 4
123 Orange Banana 1 1 1
11 Strawberry 3
773 Kiwi Banana 1 2
我想做的是简化数据(如果可能)并折叠“Fruit”和“Amount”变量
BuyerID Fruit Amount Total Count
879 "Banana" "Apple" 4 3 7 2
765 "Strawberry" "Apple" "Orange" 1 2 4 7 3
123 "Orange" "Banana" 1 1 1 3 2
11 "Strawberry" 3 3 1
773 "Kiwi" "Banana" 1 2 3 2
我尝试过使用c()和rbind(),但是它们没有产生我想要的结果 - 我在这里尝试过提示:data.frame rows to a list但是我不太确定这是否是最好的简化数据的方法。
这可能会让我更容易处理更少的变量来计算某些项目的出现(例如60%的买家购买香蕉)。
我希望这是可行的 - 我也对任何建议持开放态度。任何解决方案都赞赏!
谢谢。
答案 0 :(得分:11)
尝试复制数据,并使用data.table
DT <- data.frame(
BuyerID = c(879,765,123,11,773),
Fruit.1 = c('Banana','Strawberry','Orange','Strawberry','Kiwi'),
Fruit.2 = c('Apple','Apple','Banana',NA,'Banana'),
Fruit.3 = c( NA, 'Orange',NA,NA,NA),
Amount.1 = c(4,1,1,3,1), Amount.2 = c(3,2,1,NA,2), Amount.3 = c(NA,4,1,NA,NA),
Total = c(7,7,3,3,3),
Count = c(2,3,2,1,2),
stringsAsFactors = FALSE)
# reshaping to long form and data.table
library(data.table)
DTlong <- data.table(reshape(DT, varying = list(Fruit = 2:4, Amount = 5:7),
direction = 'long'))
# create lists (without NA values)
# also adding count and total columns
# by using <- to save Fruit and Amount for later use
DTlist <- DTlong[, list(Fruit <- list(as.vector(na.omit(Fruit.1))),
Amount <- list(as.vector(na.omit(Amount.1))),
Count = length(unlist(Fruit)),
Total = sum(unlist(Amount))),
by = BuyerID]
BuyerID V1 V2 Count Total
1: 879 Banana,Apple 4,3 2 7
2: 765 Strawberry,Apple,Orange 1,2,4 3 7
3: 123 Orange,Banana 1,1,1 2 3
4: 11 Strawberry 3 1 3
5: 773 Kiwi,Banana 1,2 2 3
@RicardoSaporta编辑:
如果您愿意,可以使用list(list(c(....)))
来跳过重塑步骤
这可能会节省相当多的执行时间(缺点是它增加了NA
而不是空格)。但是,正如@Marius指出的那样,上面的DTlong
可能更容易使用。
DT <- data.table(DT)
DT[, Fruit := list(list(c( Fruit.1, Fruit.2, Fruit.3))), by=BuyerID]
DT[, Ammount := list(list(c(Amount.1, Amount.2, Amount.3))), by=BuyerID]
# Or as a single line
DT[, list( Fruit = list(c( Fruit.1, Fruit.2, Fruit.3)),
Ammount = list(c(Amount.1, Amount.2, Amount.3)),
Total, Count), # other columns used
by = BuyerID]
答案 1 :(得分:6)
这是一个带有基础包的解决方案。这就像泰勒解决方案,但只需一次申请。
res <- apply(DT,1,function(x){
data.frame(Fruit= paste(na.omit(x[2:4]),collapse=' '),
Amount = paste(na.omit(x[5:7]),collapse =','),
Total = sum(as.numeric(na.omit(x[5:7]))),
Count = length(na.omit(x[2:4])))
})
do.call(rbind,res)
Fruit Amount Total Count
1 Banana Apple 4, 3 7 2
2 Strawberry Apple Orange 1, 2, 4 7 3
3 Orange Banana 1, 1, 1 3 2
4 Strawberry 3 3 1
5 Kiwi Banana 1, 2 3 2
我也会用grep改变索引号,就像这样
Fruit = gregexpr('Fruit[.][0-9]', colnames(dat)) > 0
Amount = gregexpr('Amount[.][0-9]', colnames(dat)) > 0
x[2:4] replace by x[which(Fruit)]....
编辑添加一些基准测试。
library(microbenchmark)
library(data.table)
microbenchmark(ag(),mn(), am(), tr())
Unit: milliseconds
expr min lq median uq max
1 ag() 11.584522 12.268140 12.671484 13.317934 109.13419
2 am() 9.776206 10.515576 10.798504 11.437938 137.44867
3 mn() 6.470190 6.805646 6.974797 7.290722 48.68571
4 tr() 1.759771 1.929870 2.026960 2.142066 7.06032
对于小型数据框架, Tyler Rinker 是赢家!我如何解释这个(只是一个猜测)
答案 2 :(得分:5)
这是一个非常糟糕的主意,但它在基础data.frame
。它起作用,因为data.frame
实际上是等长矢量的列表。你可以强制data.frame
在单元格中存储向量,但它需要一些hackery。我建议其他格式,包括Marius的建议或列表。
DT <- data.frame(
BuyerID = c(879,765,123,11,773),
Fruit.1 = c('Banana','Strawberry','Orange','Strawberry','Kiwi'),
Fruit.2 = c('Apple','Apple','Banana',NA,'Banana'),
Fruit.3 = c( NA, 'Orange',NA,NA,NA),
Amount.1 = c(4,1,1,3,1), Amount.2 = c(3,2,1,NA,2), Amount.3 = c(NA,4,1,NA,NA),
stringsAsFactors = FALSE)
DT2 <- DT[, 1, drop=FALSE]
DT2$Fruit <- apply(DT[, 2:4], 1, function(x) unlist(na.omit(x)))
DT2$Amount <- apply(DT[, 5:7], 1, function(x) unlist(na.omit(x)))
DT2$Total <- sapply(DT2$Amount, sum)
DT2$Count <- sapply(DT2$Fruit, length)
产量:
> DT2
BuyerID Fruit Amount Total Count
1 879 Banana, Apple 4, 3 7 2
2 765 Strawberry, Apple, Orange 1, 2, 4 7 3
3 123 Orange, Banana 1, 1, 1 3 2
4 11 Strawberry 3 3 1
5 773 Kiwi, Banana 1, 2 3 2
答案 3 :(得分:4)
添加已经存在的好答案,这是另一个(坚持基础R):
with(DT, {
# Convert to long format
DTlong <- reshape(DT, direction = "long",
idvar = "BuyerID", varying = 2:ncol(DT))
# aggregate your fruit columns
# You need the `do.call(data.frame, ...)` to convert
# the resulting matrix-as-a-column into separate columns
Agg1 <- do.call(data.frame,
aggregate(Fruit ~ BuyerID, DTlong,
function(x) c(Fruit = paste0(x, collapse = " "),
Count = length(x))))
# aggregate the amount columns
Agg2 <- aggregate(Amount ~ BuyerID, DTlong, sum)
# merge the results
merge(Agg1, Agg2)
})
# BuyerID Fruit.Fruit Fruit.Count Amount
# 1 11 Strawberry 1 3
# 2 123 Orange Banana 2 3
# 3 765 Strawberry Apple Orange 3 7
# 4 773 Kiwi Banana 2 3
# 5 879 Banana Apple 2 7
基本概念是:
reshape
以长篇形式获取您的数据(实际上我觉得您应该停止这些数据)aggregate
命令,一个用于汇总水果列,另一个用于汇总金额列。 aggregate
的公式方法负责删除NA
,但您可以使用na.action
参数指定所需的行为。merge
将两者合并。答案 4 :(得分:0)
当提出问题时它不存在,但tidyr
适用于此。
重用@ mnel答案中的数据,
library(tidyr)
separator <- ' '
DT %>%
unite(Fruit, grep("Fruit", names(.)), sep = separator) %>%
unite(Amount, grep("Amount", names(.)), sep = separator)
# BuyerID Fruit Amount Total Count
# 1 879 Banana Apple NA 4 3 NA 7 2
# 2 765 Strawberry Apple Orange 1 2 4 7 3
# 3 123 Orange Banana NA 1 1 1 3 2
# 4 11 Strawberry NA NA 3 NA NA 3 1
# 5 773 Kiwi Banana NA 1 2 NA 3 2