我的数据集看起来像这样:
我想按日汇总数据,结果如下:
我写了函数"聚合":
## Subset data
seller <- c ("S1", "S2","S3","S4", "S5")
buyer <- c("B1", "B2", "B3", "B4", "B5")
ss <- df[seller, buyer]
但是,由于卖方 - 买方 - 食品的某种组合不存在(例如卖方S1和买方B3没有交易),所以R给了我错误:
Error in
[{默认{1}}
有人可以帮我告诉R如何继续&#34;聚合&#34;即使卖家 - 买家之间没有交易也能发挥作用。 我感谢所有的帮助!
答案 0 :(得分:1)
以下是tidyr
和spread
Date <- as.character(Sys.Date()+0:4)
seller <- c ("S1", "S2","S3","S4", "S5")
buyer <- c("B1", "B2", "B3", "B4", "B5")
Food <- c("Coconut","Banana","Peach","Peach","Apple")
df <- data.frame(cbind(Date,seller,buyer,Food),stringsAsFactors=FALSE)
library(tidyr)
df2 <- df%>%
group_by(Date,seller,buyer)%>%
mutate(count=n())%>%
spread(Food,count)
df2[is.na(df2)] <- 0
df2
Source: local data frame [5 x 7]
Groups: Date, seller, buyer [5]
Date seller buyer Apple Banana Coconut Peach
* <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2017-04-18 S1 B1 0 0 1 0
2 2017-04-19 S2 B2 0 1 0 0
3 2017-04-20 S3 B3 0 0 0 1
4 2017-04-21 S4 B4 0 0 0 1
5 2017-04-22 S5 B5 1 0 0 0
编辑要考虑重复项,请添加summarise
步骤。数据集已被修改,以便S1,B1,Banana和同一日期发生。
Date <- as.character(Sys.Date()+c(0,0,1,2,3))
seller <- c ("S1", "S1","S3","S4", "S5")
buyer <- c("B1", "B1", "B3", "B4", "B5")
Food <- c("Banana","Banana","Peach","Peach","Apple")
df <- data.frame(cbind(Date,seller,buyer,Food),stringsAsFactors=FALSE)
library(tidyr)
df2 <- df%>%
group_by(Date,seller,buyer,Food)%>%
summarise(count=n())%>%
spread(Food,count)
df2[is.na(df2)] <- 0
df2
Date seller buyer Apple Banana Peach
* <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2017-04-19 S1 B1 0 2 0
2 2017-04-20 S3 B3 0 0 1
3 2017-04-21 S4 B4 0 0 1
4 2017-04-22 S5 B5 1 0 0
答案 1 :(得分:1)
dcast()
或reshape2
软件包中的data.table
功能会将您的数据从长格式转换为宽格式,方便地在一行中进行:
data.table::dcast(ss, ... ~ Food, value.var = "Food", fill = 0L, fun = length)
# Date seller buyer Apple Banana Coconut Peach
#1 2017-01-01 S1 B1 0 0 1 0
#2 2017-01-01 S2 B1 0 1 0 0
#3 2017-01-02 S2 B3 0 0 0 1
#4 2017-01-03 S3 B1 0 0 0 1
#5 2017-01-03 S3 B2 1 0 0 0
#6 2017-01-03 S4 B3 0 0 1 0
这也适用于重复条目,如dplyr
/ tidyr
solution编辑的示例数据。
即使对于只有6行的data.frame ss
,dcast()
的速度也是dplyr
/ tidyr
解决方案的两倍多:
Unit: milliseconds
expr min lq mean median uq max neval
tidyr2 4.765453 4.911954 5.140440 5.011259 5.163234 6.853099 100
dcast 1.934349 2.004580 2.102577 2.061972 2.122196 3.507352 100
Date <- as.Date("2017-01-01") + c(0L, 0L, 1L, 2L, 2L, 2L)
seller <- c ("S1", "S2", "S2", "S3","S3", "S4")
buyer <- c("B1", "B1", "B3", "B1", "B2", "B3")
Food <- c("Coconut", "Banana", "Peach", "Peach", "Apple", "Coconut")
ss <- data.frame(Date, seller, buyer, Food, stringsAsFactors = FALSE)
library(magrittr)
microbenchmark::microbenchmark(
tidyr2 = {
df2 <- ss%>%
dplyr::group_by(Date,seller,buyer)%>%
dplyr::mutate(count=n())%>%
dplyr::group_by(Date,seller,buyer,Food) %>%
dplyr::summarise(count=sum(count)) %>%
tidyr::spread(Food,count)
df2[is.na(df2)] <- 0
df2
},
dcast = {
data.table::dcast(ss, ... ~ Food, value.var = "Food", fill = 0L, fun = length)
}
)