如何按因子变化总和?

时间:2017-11-29 18:30:07

标签: r sum dplyr mutate

我有一个我正在处理的数据集,需要按年汇总金额。我想创建一个单独的变量,它只是从另一个变量的一个因子中总结出来的数量(例如,只是美国的数量)。以下是我必须单独进行的操作,如何将这些代码组合在一起?

country year donor amount
china 2000 germany 20
china 2000 france 30
china 2000 united states 40 
china 2000 united states 50
china 2001 germany 20
china 2001 france 30
china 2001 united states 40 
china 2001 united states 50
china 2002 germany 20
china 2002 france 30
china 2002 united states 40 
china 2002 united states 50

new.data <- old.data %>%
  group_by(country, year) %>%
  summarise(sum.amount = sum(amount)) %>%

new.data <- old.data %>%
  filter(donor == "United States")
  group_by(country, year) %>%
  summarise(us.amount = sum(amount)) %>%

1 个答案:

答案 0 :(得分:0)

您可以使用inner_join加入两个查询:

library(dplyr)

new.data = old.data %>%
  group_by(country, year) %>%
  summarise(sum.amount = sum(amount)) %>%
  inner_join(old.data %>%
         filter(donor == "united states") %>%
         group_by(country, year) %>%
         summarise(us.amount = sum(amount)))

或使用mutate作为第一个聚合,filter + summarize作为第二个聚合。对于大型数据集,第二种方法应该快得多,因为old.data只有一次传递,并且你在减少它的大小:

new.data = old.data %>%
  group_by(country, year) %>%
  mutate(sum.amount = sum(amount)) %>%
  filter(donor == "united states") %>%
  summarize(sum.amount = max(sum.amount), 
            us.amount = sum(amount))

注意:

mutate(sum.amount = sum(amount))在同一国家/地区年度组合中创建相同的sum.amount行。 summarize然后通过在美国us.amount的每个国家/地区年度组合中总结amount来创建donors。如果在此步骤中,我只写了summarize(us.amount = sum(amount))sum.amount列将丢失。但由于我按国家/地区年份进行汇总,因此我还必须使用sum.amount的汇总函数来包含它。 max(sum.amount)完成工作,因为所有sum.amount在同一国家/地区年度组合中都相同。同样,min(sum.amount)也可以。

<强>结果:

# A tibble: 3 x 4
# Groups:   country [?]
  country  year sum.amount us.amount
   <fctr> <int>      <int>     <int>
1   china  2000        140        90
2   china  2001        140        90
3   china  2002        140        90

数据:

old.data = structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = "china", class = "factor"), 
    year = c(2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 
    2001L, 2002L, 2002L, 2002L, 2002L), donor = c("germany", 
    "france", "united states", "united states", "germany", "france", 
    "united states", "united states", "germany", "france", "united states", 
    "united states"), amount = c(20L, 30L, 40L, 50L, 20L, 30L, 
    40L, 50L, 20L, 30L, 40L, 50L)), class = "data.frame", .Names = c("country", 
"year", "donor", "amount"), row.names = c(NA, -12L))