我有一个我正在处理的数据集,需要按年汇总金额。我想创建一个单独的变量,它只是从另一个变量的一个因子中总结出来的数量(例如,只是美国的数量)。以下是我必须单独进行的操作,如何将这些代码组合在一起?
country year donor amount
china 2000 germany 20
china 2000 france 30
china 2000 united states 40
china 2000 united states 50
china 2001 germany 20
china 2001 france 30
china 2001 united states 40
china 2001 united states 50
china 2002 germany 20
china 2002 france 30
china 2002 united states 40
china 2002 united states 50
new.data <- old.data %>%
group_by(country, year) %>%
summarise(sum.amount = sum(amount)) %>%
new.data <- old.data %>%
filter(donor == "United States")
group_by(country, year) %>%
summarise(us.amount = sum(amount)) %>%
答案 0 :(得分:0)
您可以使用inner_join
加入两个查询:
library(dplyr)
new.data = old.data %>%
group_by(country, year) %>%
summarise(sum.amount = sum(amount)) %>%
inner_join(old.data %>%
filter(donor == "united states") %>%
group_by(country, year) %>%
summarise(us.amount = sum(amount)))
或使用mutate
作为第一个聚合,filter
+ summarize
作为第二个聚合。对于大型数据集,第二种方法应该快得多,因为old.data
只有一次传递,并且你在减少它的大小:
new.data = old.data %>%
group_by(country, year) %>%
mutate(sum.amount = sum(amount)) %>%
filter(donor == "united states") %>%
summarize(sum.amount = max(sum.amount),
us.amount = sum(amount))
注意:强>
mutate(sum.amount = sum(amount))
在同一国家/地区年度组合中创建相同的sum.amount
行。 summarize
然后通过在美国us.amount
的每个国家/地区年度组合中总结amount
来创建donors
。如果在此步骤中,我只写了summarize(us.amount = sum(amount))
,sum.amount
列将丢失。但由于我按国家/地区年份进行汇总,因此我还必须使用sum.amount
的汇总函数来包含它。 max(sum.amount)
完成工作,因为所有sum.amount
在同一国家/地区年度组合中都相同。同样,min(sum.amount)
也可以。
<强>结果:强>
# A tibble: 3 x 4
# Groups: country [?]
country year sum.amount us.amount
<fctr> <int> <int> <int>
1 china 2000 140 90
2 china 2001 140 90
3 china 2002 140 90
数据:强>
old.data = structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "china", class = "factor"),
year = c(2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L,
2001L, 2002L, 2002L, 2002L, 2002L), donor = c("germany",
"france", "united states", "united states", "germany", "france",
"united states", "united states", "germany", "france", "united states",
"united states"), amount = c(20L, 30L, 40L, 50L, 20L, 30L,
40L, 50L, 20L, 30L, 40L, 50L)), class = "data.frame", .Names = c("country",
"year", "donor", "amount"), row.names = c(NA, -12L))