我在日期系列和一个组中汇总数据时出现问题,其中一个日期在一个但不是所有组中都缺失。
dates <- seq.Date(as.Date("2010-01-01"), by=7, length.out=5)
dates.2 <- dates[-2]
all.dates <- c(dates, dates, dates.2, dates.2)
subgroups <- c(rep("a", 5), rep("b", 5), rep("c", 4), rep("d", 4))
groups <- c(rep("X", 10), rep("Y", 8))
set.seed(2)
df.1 <- data.frame(Date = all.dates,
Group = groups,
Subgrp = subgroups,
Cost = runif(18,100,200)
)
df.1
Date Group Subgrp Cost
1 2010-01-01 X a 118.4882
2 2010-01-08 X a 170.2374
3 2010-01-15 X a 157.3326
4 2010-01-22 X a 116.8052
5 2010-01-29 X a 194.3839
6 2010-01-01 X b 194.3475
7 2010-01-08 X b 112.9159
8 2010-01-15 X b 183.3449
9 2010-01-22 X b 146.8019
10 2010-01-29 X b 154.9984
11 2010-01-01 Y c 155.2674
12 2010-01-15 Y c 123.8895
13 2010-01-22 Y c 176.0513
14 2010-01-29 Y c 118.0820
15 2010-01-01 Y d 140.5282
16 2010-01-15 Y d 185.3548
17 2010-01-22 Y d 197.6398
18 2010-01-29 Y d 122.5825
> ag.1 <- aggregate(Cost ~ Group + Date, FUN=sum, data=df.1)
> ag.1
Group Date Cost
1 X 2010-01-01 312.8357
2 Y 2010-01-01 295.7956
3 X 2010-01-08 283.1533
4 X 2010-01-15 340.6775
5 Y 2010-01-15 309.2443
6 X 2010-01-22 263.6070
7 Y 2010-01-22 373.6912
8 X 2010-01-29 349.3823
9 Y 2010-01-29 240.6646
小组Y
未在2010-01-08
上付款,但ag.1
对象在此日期对群组Y
保持沉默。我希望ag.1
有一行反映这一点:
> ag.1
Group Date Cost
1 X 2010-01-01 312.8357
2 Y 2010-01-01 295.7956
3 X 2010-01-08 283.1533
3a Y 2010-01-08 0.0000
4 X 2010-01-15 340.6775
5 Y 2010-01-15 309.2443
我在na.omit=na.pass
函数中尝试aggregate
但是(1)我真的不知道这是做什么的,(2)它没有改变输出。
欢迎不使用aggregate
的建议,但更愿意使用基础包。
答案 0 :(得分:2)
expand.grid
可用于填写缺失的条目。
df.2 <- expand.grid(Date = unique(dates),Group = unique(groups))
df <- merge(df.1,df.2,all=TRUE)
aggregate(Cost ~ Group + Date, FUN=sum, data=df, na.action=na.pass)
编辑:根据OP的提示,我找到了对aggregate
电话的适当调整。
Group Date Cost
1 X 2010-01-01 312.8357
2 Y 2010-01-01 295.7956
3 X 2010-01-08 283.1533
4 Y 2010-01-08 NA
5 X 2010-01-15 340.6775
6 Y 2010-01-15 309.2443
7 X 2010-01-22 263.6070
8 Y 2010-01-22 373.6912
9 X 2010-01-29 349.3823
10 Y 2010-01-29 240.6646
答案 1 :(得分:1)
1)只要任何日期至少有一个具有该日期的组,那么就这样做:
> as.data.frame(xtabs(Cost ~ Date + Group, df.1), responseName = "Cost")
Date Group Cost
1 2010-01-01 X 312.8357
2 2010-01-08 X 283.1533
3 2010-01-15 X 340.6775
4 2010-01-22 X 263.6070
5 2010-01-29 X 349.3823
6 2010-01-01 Y 295.7956
7 2010-01-08 Y 0.0000
8 2010-01-15 Y 309.2443
9 2010-01-22 Y 373.6912
10 2010-01-29 Y 240.6646
事实上,如果这个布局合适,上面的xtabs
部分可能就是你所需要的:
> xtabs(Cost ~ Date + Group, df.1)
Group
Date X Y
2010-01-01 312.8357 295.7956
2010-01-08 283.1533 0.0000
2010-01-15 340.6775 309.2443
2010-01-22 263.6070 373.6912
2010-01-29 349.3823 240.6646
2)如果有没有任何组有条目的日期,则将日期转换为包含在级别中的非出现日期的因子:
> # define levels to be all weeks between minimum date and 2010-02-05
> levs <- as.character(seq(min(df.1$Date), as.Date("2010-02-05"), by = 7))
> df.2 <- transform(df.1, Date = factor(Date, sort(unique(levs))))
>
> # now repeat using df.2
> as.data.frame(xtabs(Cost ~ Date + Group, df.2), responseName = "Cost")
Date Group Cost
1 2010-01-01 X 312.8357
2 2010-01-08 X 283.1533
3 2010-01-15 X 340.6775
4 2010-01-22 X 263.6070
5 2010-01-29 X 349.3823
6 2010-02-05 X 0.0000
7 2010-01-01 Y 295.7956
8 2010-01-08 Y 0.0000
9 2010-01-15 Y 309.2443
10 2010-01-22 Y 373.6912
11 2010-01-29 Y 240.6646
12 2010-02-05 Y 0.0000
> xtabs(Cost ~ Date + Group, df.2)
Group
Date X Y
2010-01-01 312.8357 295.7956
2010-01-08 283.1533 0.0000
2010-01-15 340.6775 309.2443
2010-01-22 263.6070 373.6912
2010-01-29 349.3823 240.6646
2010-02-05 0.0000 0.0000