Question

我有一个我正在处理的数据集，需要按年汇总金额。我想创建一个单独的变量，它只是从另一个变量的一个因子中总结出来的数量（例如，只是美国的数量）。以下是我必须单独进行的操作，如何将这些代码组合在一起？

country year donor amount
china 2000 germany 20
china 2000 france 30
china 2000 united states 40 
china 2000 united states 50
china 2001 germany 20
china 2001 france 30
china 2001 united states 40 
china 2001 united states 50
china 2002 germany 20
china 2002 france 30
china 2002 united states 40 
china 2002 united states 50

new.data <- old.data %>%
  group_by(country, year) %>%
  summarise(sum.amount = sum(amount)) %>%

new.data <- old.data %>%
  filter(donor == "United States")
  group_by(country, year) %>%
  summarise(us.amount = sum(amount)) %>%

Answer 1

您可以使用inner_join加入两个查询：

library(dplyr)

new.data = old.data %>%
  group_by(country, year) %>%
  summarise(sum.amount = sum(amount)) %>%
  inner_join(old.data %>%
         filter(donor == "united states") %>%
         group_by(country, year) %>%
         summarise(us.amount = sum(amount)))

或使用mutate作为第一个聚合，filter + summarize作为第二个聚合。对于大型数据集，第二种方法应该快得多，因为old.data只有一次传递，并且你在减少它的大小：

new.data = old.data %>%
  group_by(country, year) %>%
  mutate(sum.amount = sum(amount)) %>%
  filter(donor == "united states") %>%
  summarize(sum.amount = max(sum.amount), 
            us.amount = sum(amount))

注意：

mutate(sum.amount = sum(amount))在同一国家/地区年度组合中创建相同的sum.amount行。 summarize然后通过在美国us.amount的每个国家/地区年度组合中总结amount来创建donors。如果在此步骤中，我只写了summarize(us.amount = sum(amount))，sum.amount列将丢失。但由于我按国家/地区年份进行汇总，因此我还必须使用sum.amount的汇总函数来包含它。 max(sum.amount)完成工作，因为所有sum.amount在同一国家/地区年度组合中都相同。同样，min(sum.amount)也可以。

<强>结果：

# A tibble: 3 x 4 # Groups: country [?] country year sum.amount us.amount <fctr> <int> <int> <int> 1 china 2000 140 90 2 china 2001 140 90 3 china 2002 140 90

数据：

old.data = structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "china", class = "factor"), year = c(2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 2001L, 2002L, 2002L, 2002L, 2002L), donor = c("germany", "france", "united states", "united states", "germany", "france", "united states", "united states", "germany", "france", "united states", "united states"), amount = c(20L, 30L, 40L, 50L, 20L, 30L, 40L, 50L, 20L, 30L, 40L, 50L)), class = "data.frame", .Names = c("country", "year", "donor", "amount"), row.names = c(NA, -12L))

如何按因子变化总和？

1 个答案: