R:dplyr总结,仅对uniques的值求和

时间:2015-08-20 12:12:02

标签: r unique dplyr summary

我遇到了一个令人讨厌的命令,我想要对摘要进行分析,我正在使用dplyr包。用一些示例数据解释是最容易的:

structure(list(Date = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), 
    Name = structure(c(3L, 3L, 4L, 3L, 2L, 3L, 2L, 4L, 1L), .Label = c("George", 
    "Jack", "John", "Mary"), class = "factor"), Birth.Year = c(1995L, 
    1995L, 1997L, 1995L, 1999L, 1995L, 1999L, 1997L, 1997L), 
    Special_Balance = c(10L, 40L, 30L, 5L, 10L, 15L, 2L, 1L, 
    100L), Total_Balance = c(100L, 100L, 50L, 200L, 20L, 200L, 
    20L, 100L, 1600L)), .Names = c("Date", "Name", "Birth.Year", 
"Special_Balance", "Total_Balance"), class = "data.frame", row.names = c(NA, 
-9L))

我的目标是两个简单的摘要:首先,我想通过Date总结一下,代码如下所示。错误的部分是total_balance_sum计算,其中我想要计算每个人的平衡,但每个人只有一次。例如,我对Date=1的命令的结果是total_balance_sum=100,但它应该是150(将杰克的total_balance添加到玛丽的total_balance一次50次)。这个错误的计算显然会弄乱最终的pct计算。

example_data %>% 
  group_by(Date) %>% 
  summarise(
    total_people=n_distinct(Name),
    total_loan_exposures=n(),

    special_sum=sum(Special_Balance,na.rm=TRUE),
    total_balance_sum=sum(Total_Balance[n_distinct(Name)]), 
    total_pct=special_sum/total_balance_sum

  ) -> example_summary

在第二个摘要(下方)中,我按日期和出生年份进行分组,并再次错误地计算total_balance_sum

example_data %>% 
  group_by(Date,Birth.Year) %>% 
  summarise(
    total_people=n_distinct(Name),
    total_loan_exposures=n(),

    special_sum=sum(Special_Balance,na.rm=TRUE),
    total_balance_sum=sum(Total_Balance[n_distinct(Name)]), 
    total_pct=special_sum/total_balance_sum

  ) -> example_summary_birthyear

实现目标的正确方法是什么?很明显,我使用的n_distinct只是采用其中一个值,而不是在名称之间正确地对其进行求和。

感谢您的帮助。

2 个答案:

答案 0 :(得分:2)

我对你可能要求的内容有点不清楚,但是这样做你想做什么?:(仅针对第一个例子)

example_data %>% 
  group_by(Date, Name) %>% 
    summarise(
      total_loan_exposures=n(),
      total_SpecialPerson=sum(Special_Balance,na.rm=TRUE),
      total_balance_sumPerson=Total_Balance[1])%>% 
  ungroup() %>% 
  group_by(Date) %>% 
  summarise(
    total_people=n(),
    total_loan_exposures=sum(total_loan_exposures),
    special_sum=sum(total_SpecialPerson,na.rm=TRUE),
    total_balance_sum=sum(total_balance_sumPerson)) %>% 
  mutate(total_pct=(special_sum/total_balance_sum))-> example_summary

> example_summary
Source: local data frame [3 x 6]

    Date total_people total_loan_exposures special_sum total_balance_sum  total_pct
    1    1            2                    3          80               150 0.53333333
    2    2            2                    4          32               220 0.14545455
    3    3            2                    2         101              1700 0.05941176

答案 1 :(得分:1)

对于第二个例子(对于第一个例子,只需删除Birth.Year):

library(dplyr)
example_data %>% group_by(Date, Birth.Year) %>%
                 mutate(special_sum = sum(Special_Balance),
                        total_loan_exposure = n( )) %>%
                 distinct(Name, Total_Balance) %>%
                 summarise(Total_balance_sum = sum(Total_Balance),
                           special_sum = special_sum[1],
                           total_people = n(),
                           total_loan_exposure = total_loan_exposure[1],
                           special_sum/Total_balance_sum)