新数据框列以现有列为条件

时间:2014-12-22 18:37:47

标签: r dataframe

我正在尝试制作一些关于本地基础设施项目的调查结果的热图。该调查要求人们预测该项目的主要成本和主要收益。我已经使用ggplot来制作成本和收益的简单热图。现在,我想在数据集中创建一个新的“频率”列(见下文),该列根据“成本”列中每个项目的类别总计进行标准化。所以我想将“Frequency2”中的前四项作为“频率”栏中的相应项目除以表示房价是主要成本(61)然后再乘以100得出的总人数。百分。在R中有快速的方法吗?在Excel中我会使用sumif来提出类别总计,然后我只使用if语句来创建新列。 R中有类似的过程吗?谢谢!

       Benefits    Costs         Frequency
14    Local Comp  Housing Prices    8
16          Jobs  Housing Prices   26
17         Other  Housing Prices    0
18          None  Housing Prices   27
20    Local Comp         Traffic    7
22          Jobs         Traffic   17
23         Other         Traffic    1
24          None         Traffic   11

数据

    df <- data.frame(Benefits=c("Local Comp", "Jobs", "Other", "None", "Local Comp", "Jobs", "Other", "None"),
Costs=c("Housing Prices", "Housing Prices", "Housing Prices", "Housing Prices", "Traffic", "Traffic", "Traffic", "Traffic"),
Frequency=c(8,26,0,27,7,17,1,11))

1 个答案:

答案 0 :(得分:3)

您可以使用ave计算每组的频率总和。我在transform内执行此操作:

transform(df, Frequency2 = Frequency / ave(Frequency, Costs, FUN = sum) * 100)
#     Benefits          Costs Frequency Frequency2
#14 Local_Comp Housing_Prices         8  13.114754
#16       Jobs Housing_Prices        26  42.622951
#17      Other Housing_Prices         0   0.000000
#18       None Housing_Prices        27  44.262295
#20 Local_Comp        Traffic         7  19.444444
#22       Jobs        Traffic        17  47.222222
#23      Other        Traffic         1   2.777778
#24       None        Traffic        11  30.555556

或者,如果您拥有非常大的数据集,则可以使用dplyr获得更高的性能:

library(dplyr)
df %>% group_by(Costs) %>% mutate(Frequency2 = Frequency / sum(Frequency) * 100)
#Source: local data frame [8 x 4]
#Groups: Costs
#
#    Benefits          Costs Frequency Frequency2
#1 Local_Comp Housing_Prices         8  13.114754
#2       Jobs Housing_Prices        26  42.622951
#3      Other Housing_Prices         0   0.000000
#4       None Housing_Prices        27  44.262295
#5 Local_Comp        Traffic         7  19.444444
#6       Jobs        Traffic        17  47.222222
#7      Other        Traffic         1   2.777778
#8       None        Traffic        11  30.555556

或使用data.table:

library(data.table)
setDT(df)[, Frequency2 := Frequency / sum(Frequency) * 100, by = Costs ]