当缺少时间序列数据组合时如何聚合

时间:2017-04-18 22:31:12

标签: r dataframe aggregate missing-data

我的数据集看起来像这样:

enter image description here

我想按日汇总数据,结果如下:

enter image description here

我写了函数"聚合":

## Subset data
seller <- c ("S1", "S2","S3","S4", "S5")
buyer <- c("B1", "B2", "B3", "B4", "B5")
ss <- df[seller, buyer]

但是,由于卖方 - 买方 - 食品的某种组合不存在(例如卖方S1和买方B3没有交易),所以R给了我错误:  Error in [{默认{1}}

有人可以帮我告诉R如何继续&#34;聚合&#34;即使卖家 - 买家之间没有交易也能发挥作用。 我感谢所有的帮助!

2 个答案:

答案 0 :(得分:1)

以下是tidyrspread

的一种方式
Date <- as.character(Sys.Date()+0:4)
seller <- c ("S1", "S2","S3","S4", "S5")
buyer <- c("B1", "B2", "B3", "B4", "B5")
Food <- c("Coconut","Banana","Peach","Peach","Apple")

df <- data.frame(cbind(Date,seller,buyer,Food),stringsAsFactors=FALSE)

library(tidyr)
df2 <- df%>%
group_by(Date,seller,buyer)%>%
mutate(count=n())%>%
spread(Food,count)
df2[is.na(df2)] <- 0
df2

Source: local data frame [5 x 7]
Groups: Date, seller, buyer [5]

        Date seller buyer Apple Banana Coconut Peach
*      <chr>  <chr> <chr> <dbl>  <dbl>   <dbl> <dbl>
1 2017-04-18     S1    B1     0      0       1     0
2 2017-04-19     S2    B2     0      1       0     0
3 2017-04-20     S3    B3     0      0       0     1
4 2017-04-21     S4    B4     0      0       0     1
5 2017-04-22     S5    B5     1      0       0     0

编辑要考虑重复项,请添加summarise步骤。数据集已被修改,以便S1,B1,Banana和同一日期发生。

Date <- as.character(Sys.Date()+c(0,0,1,2,3))
seller <- c ("S1", "S1","S3","S4", "S5")
buyer <- c("B1", "B1", "B3", "B4", "B5")
Food <- c("Banana","Banana","Peach","Peach","Apple")

df <- data.frame(cbind(Date,seller,buyer,Food),stringsAsFactors=FALSE)

library(tidyr)
df2 <- df%>%
group_by(Date,seller,buyer,Food)%>%
summarise(count=n())%>%
spread(Food,count)

df2[is.na(df2)] <- 0
df2

        Date seller buyer Apple Banana Peach
*      <chr>  <chr> <chr> <dbl>  <dbl> <dbl>
1 2017-04-19     S1    B1     0      2     0
2 2017-04-20     S3    B3     0      0     1
3 2017-04-21     S4    B4     0      0     1
4 2017-04-22     S5    B5     1      0     0

答案 1 :(得分:1)

dcast()reshape2软件包中的data.table功能会将您的数据从长格式转换为宽格式,方便地在一行中进行:

data.table::dcast(ss, ... ~ Food, value.var = "Food", fill = 0L, fun = length)
#        Date seller buyer Apple Banana Coconut Peach
#1 2017-01-01     S1    B1     0      0       1     0
#2 2017-01-01     S2    B1     0      1       0     0
#3 2017-01-02     S2    B3     0      0       0     1
#4 2017-01-03     S3    B1     0      0       0     1
#5 2017-01-03     S3    B2     1      0       0     0
#6 2017-01-03     S4    B3     0      0       1     0

这也适用于重复条目,如dplyr / tidyr solution编辑的示例数据。

基准测试结果

即使对于只有6行的data.frame ssdcast()的速度也是dplyr / tidyr解决方案的两倍多:

Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval
 tidyr2 4.765453 4.911954 5.140440 5.011259 5.163234 6.853099   100
  dcast 1.934349 2.004580 2.102577 2.061972 2.122196 3.507352   100 

基准代码

Date <- as.Date("2017-01-01") + c(0L, 0L, 1L, 2L, 2L, 2L)
seller <- c ("S1", "S2", "S2", "S3","S3", "S4")
buyer <- c("B1", "B1", "B3", "B1", "B2", "B3")
Food <- c("Coconut", "Banana", "Peach", "Peach", "Apple", "Coconut")
ss <- data.frame(Date, seller, buyer, Food, stringsAsFactors = FALSE)

library(magrittr)
microbenchmark::microbenchmark(
  tidyr2 = {
    df2 <- ss%>%
      dplyr::group_by(Date,seller,buyer)%>%
      dplyr::mutate(count=n())%>%
      dplyr::group_by(Date,seller,buyer,Food)  %>%
      dplyr::summarise(count=sum(count))  %>%
      tidyr::spread(Food,count)
    df2[is.na(df2)] <- 0
    df2
  },
  dcast = {
    data.table::dcast(ss, ... ~ Food, value.var = "Food", fill = 0L, fun = length)
  }
)