统计因子水平随着时间的推移

时间:2016-02-23 10:45:29

标签: r date grouping factors

我有以下data.frame,如下所示:

head(entries,10)

     Provider.Region      year.start    month.start day.start  Provider.Status
23511      North West       0010          05        17 Deregistered (V)
23512      North West       0010          05        17 Deregistered (V)
23709   West Midlands       0010          06        01       Registered
23562          London       0010          06        10       Registered
23563          London       0010          06        10       Registered
23566          London       0010          06        10       Registered
23764   West Midlands       0010          06        10 Deregistered (V)
23508          London       0010          06        11 Deregistered (V)
23555   West Midlands       0010          06        11       Registered
23497      South East       0010          06        14 Deregistered (V)

我想按月计算与Provider.Status对应的因子水平。我想要的输出应该是这样的:

head(entries.1, 3)

time    region        Deregistered (V) Registered 
5-0010  North West        2              0
6-0010  West Midlands     2              1
6-0010  London            1              3

目前我一直在使用dplyr,如下所示

library(dplyr)
entries %>%
  group_by(Provider.Region, year.start, month.start) %>%
  mutate(counts_status = n())  

但仍然没有产生我预期的输出,因为它给出了类似的东西:

Source: local data frame [23,775 x 6]
Groups: Provider.Region, year.start, month.start [606]

Provider.Region year.start month.start  Provider.Status counts_status
(fctr)     (fctr)      (fctr)              (fctr)         (int)
1       North West       0010          05 Deregistered (V)      2
2       North West       0010          05 Deregistered (V)      2
3    West Midlands       0010          06 Registered            4
4           London       0010          06 Registered            7
5           London       0010          06 Registered            7
6           London       0010          06 Registered            7
7    West Midlands       0010          06 Deregistered (V)      4
8           London       0010          06 Deregistered (V)      7
9    West Midlands       0010          06 Registered            4
10      South East       0010          06 Deregistered (V)      10
..             ...        ...         ...       ...              ...

有没有可以从计数中创建变量的紧凑方式?非常感谢提前

2 个答案:

答案 0 :(得分:2)

这可以使用 reshape2 data.table 包中的dcast函数来实现:

library(reshape2)
dcast(mydf, paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status)

library(data.table)
dcast(setDT(mydf), paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status)

最后一个的输出:

   year.start Provider.Region Deregistered(V) Registered
1:    0010-05       NorthWest               2          0
2:    0010-06          London               1          3
3:    0010-06       SouthEast               1          0
4:    0010-06    WestMidlands               1          2

使用上述代码时,您会收到一条警告消息:

Using 'Provider.Status' as value column. Use 'value.var' to override
Aggregate function missing, defaulting to 'length'

这没有任何意义,但是为了防止您可以指定value.var和聚合函数:

dcast(setDT(mydf), 
      paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status,
      value.var = "Provider.Status", fun.aggregate = length)

答案 1 :(得分:1)

您可以使用reshape2包来生成这样的表格:

library(reshape2)
d <- data.frame(region=rep(c("A", "B", "C"), each=2), timepoint = c(1, 1, 1, 1, 2, 2), provider=rep(c("D", "R"), 3), count_status = 1:6)
dcast(d, region + timepoint ~ provider, value.var = "count_status")

获得此输出:

  region timepoint D R
1      A         1 1 2
2      B         1 3 4
3      C         2 5 6