合并三个因素,使它们的因变量在R

时间:2018-09-26 13:43:57

标签: r

不确定是否有人回答了这个问题-我进行了搜索,但到目前为止,对我来说没有任何帮助。我有一个非常大的数据集,我想缩小范围。我需要在我的"PROG"变量("Grad.2","Grad.3","Grad.H")中组合三个因子,以便它们成为单个变量("Grad"),其中每个可比较值集的因变量("NUMBER")总结。

即。

YEAR = "92/93"    AGE = "20-24"   PROG = "Grad.2"   NUMBER = "50"

YEAR = "92/93"    AGE = "20-24"   PROG = "Grad.3"   NUMBER = "25"

YEAR = "92/93"    AGE = "20-24"   PROG = "Grad.H"   NUMBER = "2"

变成

YEAR = "92/93"    AGE = "20-24"   PROG = "Grad"   NUMBER = "77"

然后我想删除PROG的所有其他因素,以便我可以比较Grad的入学率,而不必担心其他因素(我将分别处理)。因此,我的活动自变量是YEARAGE,而因变量是NUMBER

我希望这能充分显示我的数据:

structure(list
(YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L), .Label = c("92/93", "93/94", "94/95", "95/96", "96/97", 
    "97/98", "98/99", "99/00", "00/01", "01/02", "02/03", "03/04", 
    "04/05", "05/06", "06/07", "07/08", "08/09", "09/10", "10/11", 
    "11/12", "12/13", "13/14", "14/15", "15/16"), class = "factor"), 
AGE = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L), .Label = c("1-19", 
            "20-24", "25-30", "31-34", "35-39", "40+", "NR", "T.Age"), class = c("ordered", 
            "factor")), 
PROG = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                19L, 19L, 19L), .Label = c("T.Prog", "Basic", "Career", "Grad.H", 
                "Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res", 
                "NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred", 
                "Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual", 
                "Und.Grad", "Und.Grad.Qual"), class = "factor"), 
NUMBER = c(104997L, 
                347235L, 112644L, 38838L, 35949L, 50598L, 5484L, 104991L, 
                333807L, 76692L)), row.names = c(7936L, 7948L, 7960L, 7972L, 
            7984L, 7996L, 8008L, 10459L, 10471L, 10483L), class = "data.frame")

就我为什么使用因子而言,我不知道该如何输入数据。因素很有意义,这就是R在我上传原始数据时如何解释原始数据。

我正在研究以下建议。还没有成功,但是我仍在学习如何让R来做我想要的事情,并且经常搞砸。只要我有合理的答案,就会尽快回复大家。 (一旦我不再敲打我那可怜的头在桌子上……叹气)

4 个答案:

答案 0 :(得分:0)

如果我正确理解了您的问题,则应该这样做。 我假设您的数据框名为df

library(tidyverse)

df %>%
mutate(PROG = ifelse(PROG %in% c("Grad2", "Grad3","Grad.H"), 
                     "Grad",
                     NA)) %>% ##combines the 3 Grad variables into one
filter(!is.na(PROG)) %>%     ##drops the other variables
group_by(YEAR, AGE) %>%      
summarise(NUMBER = sum(NUMBER)) 

答案 1 :(得分:0)

稍有不同的方法:仅取所需因子,删除因子变量(因为要将它们视为一个组),并在对所有其他变量分组的同时总结所有NUMBER值。 df是您的数据。

aggregate(formula = NUMBER ~ .,
          data = subset(df, PROG %in% c("Grad2", "Grad3", "Grad.H"), select = -PROG),
          FUN = sum)

答案 2 :(得分:-1)

我认为,levels()函数正是您想要的。从手册中:

## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z

我命名了您的数据临时文件并运行了此代码。它对我有用。

z<-gl(n=length(temp$PROG),k=2,labels=c("T.Prog", "Basic", "Career", "Grad.H", 
            "Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res", 
            "NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred", 
            "Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual", 
            "Und.Grad", "Und.Grad.Qual"))
z
levels(z)<-c(rep("Other",3),rep("Grad",5),rep("Other",12))
z
temp$PROG2<-factor(x=temp$PROG,levels=levels(temp$PROG),labels=z)
temp

答案 3 :(得分:-1)

有多种方法可以执行此操作,但是我同意FScott的观点,您可能正在寻找level()函数来重命名因子级别。这是第二步求和的方法。

library(magrittr)
library(dplyr)

#do the renaming of the PROG variables here

#sum by PROG
df <- df %>%
   group_by(PROG) %>%  # you could add more variable names here to group by i.e. group_by(PROG, AGE, YEAR)
   mutate(group.sum= sum(NUMBER))

此块将在df中创建一个名为group.sum的新列,其中包含group_by()函数定义的子组之间的总和

如果您想将data.frame进一步压缩为NUMBER中的各个值替换为group.sum,那么有很多方法可以做到这一点,但这是一种简单的方法。

#condense df down
df$number <- df$group.sum
df <- df[,-ncol(df)]
df <- unique(df)

旁注:我不建议您进行上述操作,因为您会丢失数据中的信息,并且只有额外的列group.sum

,您的数据才会更整洁