Question

想象一下，我有以下数据：

    Year   Month  State  ppo 
    2011   Jan     CA    220 
    2011   Feb     CA    250
    2012   Jan     CA    230 
    2011   Jan     WA    200 
    2011   Feb     WA    210

我需要计算一年中每个州的平均值，因此输出看起来像这样：

    Year   Month  State  ppo  annualAvg
    2011   Jan     CA    220    230
    2011   Feb     CA    240    230
    2012   Jan     CA    260    260
    2011   Jan     WA    200    205
    2011   Feb     WA    210    205

其中年平均值是同年该州的任何条目的平均值。如果年份和状态不变，我会知道如何做到这一点，但不知何故，他们是变数的事实让我失望。

环顾四周，似乎ddply可能是我想要用于此（https://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r），但当我尝试使用它时，我做错了什么并且一直出错（我试过这么多它的变化，我不打扰在这里发布它们）。知道我实际上应该怎么做吗？

感谢您的帮助！

Answer 1

试试这个：

library(data.table)


setDT(df) 

df[ , annualAvg := mean(ppo) , by =.(Year, State) ]

Answer 2

使用dplyr和group_by %>% mutate添加列：

library(dplyr)
df %>% group_by(Year, State) %>% mutate(annualAvg = mean(ppo))

#Source: local data frame [5 x 5]
#Groups: Year, State [3]

#   Year  Month  State   ppo annualAvg
#  (int) (fctr) (fctr) (int)     (dbl)
#1  2011    Jan     CA   220       235
#2  2011    Feb     CA   250       235
#3  2012    Jan     CA   230       230
#4  2011    Jan     WA   200       205
#5  2011    Feb     WA   210       205

使用data.table：

library(data.table)
setDT(df)[, annualAvg := mean(ppo), .(Year, State)]

df
#   Year Month State ppo annualAvg
#1: 2011   Jan    CA 220       235
#2: 2011   Feb    CA 250       235
#3: 2012   Jan    CA 230       230
#4: 2011   Jan    WA 200       205
#5: 2011   Feb    WA 210       205

数据：

structure(list(Year = c(2011L, 2011L, 2012L, 2011L, 2011L), Month = structure(c(2L, 1L, 2L, 2L, 1L), .Label = c("Feb", "Jan"), class = "factor"), State = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("CA", "WA"), class = "factor"), ppo = c(220L, 250L, 230L, 200L, 210L), annualAvg = c(235, 235, 230, 205, 205)), .Names = c("Year", "Month", "State", "ppo", "annualAvg"), class = c("data.table", "data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x105000778>)

Answer 3

基地R：df$ppoAvg <- ave(df$ppo, df$State, df$Year, FUN = mean)

得到R中可变数据子集的均值

3 个答案: