R:基于因子水平和年份的条件聚合

时间:2016-12-03 21:14:35

标签: r sum conditional aggregate

我在R中有一个数据集,我试图按列级别和年份聚合,如下所示:

    City  State   Year   Status      Year_repealed   PolicyNo
    Pitt   PA     2001   InForce                        6
    Phil.  PA     2001   Repealed        2004           9
    Pitt   PA     2002   InForce                        7
    Pitt   PA     2005   InForce                        2

我想创建的是每年的哪些地方,我考虑到政策被废除的日期,在各州汇总了PolicyNo。我得到的结果是:

    Year    State PolicyNo
    2001     PA     15  
    2002     PA     22
    2003     PA     22
    2004     PA     12 
    2005     PA     14

我不确定如何以废除数据的方式分割和聚合数据,并且想知道是否有一种方法可以轻松实现这一点。

2 个答案:

答案 0 :(得分:2)

它可以帮助您将其分解为两个不同的问题。

  1. 获取一个表格,显示每个城市州年度中PolicyNo的变化。
  2. 总结该表以显示每个州年的PolicyNo。
  3. 要完成(1),我们使用NA PolicyNo添加缺失的年份,并将废除添加为负PolicyNo观察。

    library(dplyr)
    
    df = structure(list(City = c("Pitt", "Phil.", "Pitt", "Pitt"), State = c("PA", "PA", "PA", "PA"), Year = c(2001L, 2001L, 2002L, 2005L), Status = c("InForce", "Repealed", "InForce", "InForce"), Year_repealed = c(NA, 2004L, NA, NA), PolicyNo = c(6L, 9L, 7L, 2L)), .Names = c("City", "State", "Year", "Status", "Year_repealed", "PolicyNo"), class = "data.frame", row.names = c(NA, -4L))
    
    repeals = df %>%
      filter(!is.na(Year_repealed)) %>%
      mutate(Year = Year_repealed, PolicyNo = -1 * PolicyNo)
    repeals
    #    City State Year   Status Year_repealed PolicyNo
    # 1 Phil.    PA 2004 Repealed          2004       -9
    
    all_years = expand.grid(City = unique(df$City), State = unique(df$State),
                            Year = 2001:2005)
    
    df = bind_rows(df, repeals, all_years)
    #     City State Year   Status Year_repealed PolicyNo
    # 1   Pitt    PA 2001  InForce            NA        6
    # 2  Phil.    PA 2001 Repealed          2004        9
    # 3   Pitt    PA 2002  InForce            NA        7
    # 4   Pitt    PA 2005  InForce            NA        2
    # 5  Phil.    PA 2004 Repealed          2004       -9
    # 6   Pitt    PA 2001     <NA>            NA       NA
    # 7  Phil.    PA 2001     <NA>            NA       NA
    # 8   Pitt    PA 2002     <NA>            NA       NA
    # 9  Phil.    PA 2002     <NA>            NA       NA
    # 10  Pitt    PA 2003     <NA>            NA       NA
    # 11 Phil.    PA 2003     <NA>            NA       NA
    # 12  Pitt    PA 2004     <NA>            NA       NA
    # 13 Phil.    PA 2004     <NA>            NA       NA
    # 14  Pitt    PA 2005     <NA>            NA       NA
    # 15 Phil.    PA 2005     <NA>            NA       NA
    

    现在,该表显示了每个州 - 州 - 年并包含废除。这是我们可以总结的表格。

    df = df %>%
      group_by(Year, State) %>%
      summarize(annual_change = sum(PolicyNo, na.rm = TRUE))
    df
    # Source: local data frame [5 x 3]
    # Groups: Year [?]
    # 
    #    Year State annual_change
    #   <int> <chr>         <dbl>
    # 1  2001    PA            15
    # 2  2002    PA             7
    # 3  2003    PA             0
    # 4  2004    PA            -9
    # 5  2005    PA             2
    

    这让我们在每个州都改变了PolicyNo。这些变化的累积总和可以获得我们的水平。

    df = df %>%
      ungroup() %>%
      mutate(PolicyNo = cumsum(annual_change))
    df
    # # A tibble: 5 × 4
    #    Year State annual_change PolicyNo
    #   <int> <chr>         <dbl>    <dbl>
    # 1  2001    PA            15       15
    # 2  2002    PA             7       22
    # 3  2003    PA             0       22
    # 4  2004    PA            -9       13
    # 5  2005    PA             2       15
    

答案 1 :(得分:0)

使用data.table包,您可以按照以下方式执行此操作:

melt(setDT(dat), 
     measure.vars = c(3,5),
     value.name = 'Year',
     value.factor = FALSE)[!is.na(Year)
                           ][variable == 'Year_repealed', PolicyNo := -1*PolicyNo
                             ][CJ(Year = min(Year):max(Year), State = State, unique = TRUE), on = .(Year, State)
                               ][is.na(PolicyNo), PolicyNo := 0
                                 ][, .(PolicyNo = sum(PolicyNo)), by = .(Year, State)
                                   ][, .(Year, State, PolicyNo = cumsum(PolicyNo))]

上述代码的结果:

   Year State PolicyNo
1: 2001    PA       15
2: 2002    PA       22
3: 2003    PA       22
4: 2004    PA       13
5: 2005    PA       15

正如您所看到的,需要采取几个步骤来达到预期的结果:

  • 首先转换为data.table(setDT(dat))并将其重新整形为长格式并删除没有Year的行
  • 然后,将'Year_repealed'的行设为负值。
  • 使用交叉加入(CJ),您可以确保每个州的年份都存在,并将NA - PolicyNo列中的值转换为零。
  • 最后,您按年份汇总并对结果进行累积求和。