使用group_by()/ summarize()循环中的条件

时间:2017-08-22 23:55:55

标签: r dplyr summarization

我有一个看起来像这样的数据框(我有更多的年份和变量):

Name    State2014     State2015  State2016  Tuition2014   Tuition2015  Tuition2016  StateGrants2014
Jared   CA            CA         MA         22430         23060        40650        5000
Beth    CA            CA         CA         36400         37050        37180        4200
Steven  MA            MA         MA         18010         18250        18720        NA
Lary    MA            CA         MA         24080         30800        24600        6600
Tom     MA            OR         OR         40450         15800        16040        NA
Alfred  OR            OR         OR         23570         23680        23750        3500
Cathy   OR            OR         OR         32070         32070        33040        4700

我的目标(在这个例子中)是获得每个州的平均学费,以及每个州的州补助金总额。我的想法是按年分配数据:

State2014     Tuition2014   StateGrants2014
CA            22430         5000
CA            36400         4200
MA            18010         NA
MA            24080         6600
MA            40450         NA
OR            23570         3500
OR            32070         4700

State2015  Tuition2015  
CA         23060        
CA         37050        
MA         18250        
CA         30800        
OR         15800        
OR         23680        
OR         32070       

State2016  Tuition2016  
MA         40650        
CA         37180        
MA         18720        
MA         24600        
OR         16040        
OR         23750        
OR         33040 

然后我会group_by陈述summarize(并将每个作为单独的df保存)以获得以下内容:

State2014     Tuition2014   StateGrants2014
CA            29415         9200
MA            27513         6600
OR            27820         6600

State2015  Tuition2015  
CA         30303        
MA         18250        
OR         23850    

State2016  Tuition2016  
CA         37180        
MA         27990        
OR         24277        

然后我会合并by by状态。这是我的代码:

years = c(2014,2015,2016)
for (i in seq_along(years){
  #grab the variables from a certain year and save as a new df.
  df_year <- df[, grep(paste(years[[i]],"$",sep=""), colnames(df))]

  #Take off the year from each variable name (to make it easier to summarize)
  names(df_year) <- gsub(years[[i]], "", names(df_year), fixed = TRUE)

  df_year <- df_year %>%
    group_by(state) %>%
    summarize(Tuition = mean(Tuition, na.rm = TRUE),
            #this part of the code does not work. In this example, I only want to have this part if the year is 2016.
              if (years[[i]]=='2016')
                {Stategrant = mean(Stategrant, na.rm = TRUE)})

  #rename df_year to df####
  assign(paste("df",years[[i]],sep=''),df_year)
}

我有大约50年的数据和大量的变量,所以我想使用一个循环。所以我的问题是,如何在group_by() / summarize()函数中添加一个条件语句(总结一年中的某些变量)?谢谢!

*编辑:我意识到我可以从函数中取出if{},然后执行以下操作:

  if (years[[i]]==2016){
      df_year <- df_year %>%
        group_by(state) %>%
        summarize(Tuition = mean(Tuition, na.rm = TRUE),
            Stategrant = mean(Stategrant, na.rm = TRUE))

      #rename df_year to df####
      assign(paste("df",years[[i]],sep=''),df_year)
  }

  else{
        df_year <- df_year %>%
            group_by(state) %>%
            summarize(Tuition = mean(Tuition, na.rm = TRUE))

          #rename df_year to df####
          assign(paste("df",years[[i]],sep=''),df_year)
  {
}

但是有很多变量组合,使用for循环不会非常有效或有用。

1 个答案:

答案 0 :(得分:4)

使用tidy数据可以轻松实现这一点,因此,让我向您展示如何整理数据。请参阅http://r4ds.had.co.nz/tidy-data.html

library(tidyr)
library(dplyr)

df <- gather(df, key, value, -Name) %>% 
  # separate years from the variables
  separate(key, c("var", "year"), sep = -5) %>% 
  # the above line splits up e.g. State2014 into State and 2014.
  # It does so by splitting at the fifth element from the end of the
  # entry. Please check that this works for your other variables
  # in case your naming conventions are inconsistent.
  spread(var, value) %>% 
  # turn numbers back to numeric
  mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>% 
  gather(var, val, -Name, -year, -State) %>% 
  # group by the variables of interest. Note that `var` here 
  # refers to Tuition and StateGrants. If you have more variables,
  # they will be included here as well. If you want to exclude more
  # variables from being included here in `var`, add more "-colName" 
  # entries in the `gather` statement above
  group_by(year, State, var) %>% 
  # summarize:
  summarise(mean_values = mean(val))

这会给你:

Source: local data frame [18 x 4]
Groups: year, State [?]
    year State         var mean_values
   <chr> <chr>       <chr>       <dbl>
1   2014    CA StateGrants     4600.00
2   2014    CA     Tuition    29415.00
3   2014    MA StateGrants          NA
4   2014    MA     Tuition    27513.33
5   2014    OR StateGrants     4100.00
6   2014    OR     Tuition    27820.00
7   2015    CA StateGrants          NA
8   2015    CA     Tuition    30303.33
9   2015    MA StateGrants          NA
10  2015    MA     Tuition    18250.00
11  2015    OR StateGrants          NA
12  2015    OR     Tuition    23850.00
13  2016    CA StateGrants          NA
14  2016    CA     Tuition    37180.00
15  2016    MA StateGrants          NA
16  2016    MA     Tuition    27990.00
17  2016    OR StateGrants          NA
18  2016    OR     Tuition    24276.67

如果你不喜欢这种形状,你可以例如在%>% spread(var, mean_values)语句后面添加summarise,以便在不同的列中使用学费和StateGrants。

如果您想为学费和助学金计算不同的职能(例如学费的平均值和助学金总额,您可以执行以下操作:

df <- gather(df, key, value, -Name) %>% 
   separate(key, c("var", "year"), sep = -5) %>% 
   spread(var, value) %>% 
   mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>% 
   group_by(year, State) %>% 
   summarise(Grant_Sum = sum(StateGrants, na.rm=T), Tuition_Mean = mean(Tuition) )

这会给你:

Source: local data frame [9 x 4]
Groups: year [?]

   year State Grant_Sum Tuition_Mean
  <chr> <chr>     <dbl>        <dbl>
1  2014    CA      9200     29415.00
2  2014    MA      6600     27513.33
3  2014    OR      8200     27820.00
4  2015    CA         0     30303.33
5  2015    MA         0     18250.00
6  2015    OR         0     23850.00
7  2016    CA         0     37180.00
8  2016    MA         0     27990.00
9  2016    OR         0     24276.67

请注意,我在sum处使用了na.rm = T,如果所有元素都是NA,则返回0。确保这在您的用例中有用。

此外,仅举一步说明,为了获得您要求的个人data.frames,您可以使用filter(year == 2014)等,就像df_2014 <- filter(df, year == 2014)一样。