我正在尝试制作一些关于本地基础设施项目的调查结果的热图。该调查要求人们预测该项目的主要成本和主要收益。我已经使用ggplot来制作成本和收益的简单热图。现在,我想在数据集中创建一个新的“频率”列(见下文),该列根据“成本”列中每个项目的类别总计进行标准化。所以我想将“Frequency2”中的前四项作为“频率”栏中的相应项目除以表示房价是主要成本(61)然后再乘以100得出的总人数。百分。在R中有快速的方法吗?在Excel中我会使用sumif来提出类别总计,然后我只使用if语句来创建新列。 R中有类似的过程吗?谢谢!
Benefits Costs Frequency
14 Local Comp Housing Prices 8
16 Jobs Housing Prices 26
17 Other Housing Prices 0
18 None Housing Prices 27
20 Local Comp Traffic 7
22 Jobs Traffic 17
23 Other Traffic 1
24 None Traffic 11
数据
df <- data.frame(Benefits=c("Local Comp", "Jobs", "Other", "None", "Local Comp", "Jobs", "Other", "None"),
Costs=c("Housing Prices", "Housing Prices", "Housing Prices", "Housing Prices", "Traffic", "Traffic", "Traffic", "Traffic"),
Frequency=c(8,26,0,27,7,17,1,11))
答案 0 :(得分:3)
您可以使用ave
计算每组的频率总和。我在transform
内执行此操作:
transform(df, Frequency2 = Frequency / ave(Frequency, Costs, FUN = sum) * 100)
# Benefits Costs Frequency Frequency2
#14 Local_Comp Housing_Prices 8 13.114754
#16 Jobs Housing_Prices 26 42.622951
#17 Other Housing_Prices 0 0.000000
#18 None Housing_Prices 27 44.262295
#20 Local_Comp Traffic 7 19.444444
#22 Jobs Traffic 17 47.222222
#23 Other Traffic 1 2.777778
#24 None Traffic 11 30.555556
或者,如果您拥有非常大的数据集,则可以使用dplyr获得更高的性能:
library(dplyr)
df %>% group_by(Costs) %>% mutate(Frequency2 = Frequency / sum(Frequency) * 100)
#Source: local data frame [8 x 4]
#Groups: Costs
#
# Benefits Costs Frequency Frequency2
#1 Local_Comp Housing_Prices 8 13.114754
#2 Jobs Housing_Prices 26 42.622951
#3 Other Housing_Prices 0 0.000000
#4 None Housing_Prices 27 44.262295
#5 Local_Comp Traffic 7 19.444444
#6 Jobs Traffic 17 47.222222
#7 Other Traffic 1 2.777778
#8 None Traffic 11 30.555556
或使用data.table:
library(data.table)
setDT(df)[, Frequency2 := Frequency / sum(Frequency) * 100, by = Costs ]