使用select,group_by和mutate对使用dplyr进行分组的行进行汇总

时间:2019-12-10 23:38:53

标签: r dplyr

问题:我正在一个汽车市场上汇总一个总的市场份额变量,出售了286个不同型号的汽车,总共出售了501辆汽车。该组份额仅基于汽车特性:cat =“ compact”,“ midsize”,“ large”和yr = 77,78,79,80,81,份额为小double变量;市场上共有15组。

我找到的最接近的答案:由mishabalyasin在community.rstudio上发表:“使用tidyeval计算按行总计和比例吗?” link to post on community.rstudio

应用select-split-combine原理是我得到正确答案最接近的方法,是15组(15 x 3(cat,yr,s)):

df<- blp %>% 
  select(cat,yr,s) %>%
  group_by(cat,yr) %>% 
  summarise(group_share = sum(s))

#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
  for(j in 1:15){
    if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
      blp[i,50]=df[j,3]
      }
  }
}

这很棒,但是我知道这可以一口气完成...希望,从我上面的描述中可以清楚地看出这个主意。一个简单的修复方法可能是循环并由cat和yr上的条件设置,这会有所帮助,但我确实在尝试更好地处理与dplyr进行的数据争夺,因此,沿这条线获取管道化答案的任何见解都是很棒。

网站示例:下面的示例不适用于我提供的代码,但这是数据的“外观”。份额是一个问题。

#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)

blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))

names(blp)<-c("cat","yr","s")
head(blp)

#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))

#works thanks to akrun: applying the code I provided for what leads to the 15 groups 
df <- blp %>% 
    select(cat,yr,s) %>%
    group_by(cat,yr) %>% 
    summarise(group_share = sum(as.numeric(as.character(s)))) 
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
        if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
          blp[i,4]=df[j,3];
    }
  }

1 个答案:

答案 0 :(得分:0)

如果我正确理解了您的问题,那么理想情况下应该会有所帮助! 在这里,唯一的区别是可以使用mutate保留原始列并向其添加汇总列,而不是仅使用汇总列和汇总列来自动生成汇总。

# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")

yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)

s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)

# Calculation
blp <- 
  data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
  group_by(cat, yr) %>% # Grouping by category and year
  mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year 
  ungroup()

预期输出 Expected output