使R识别cut()的变量对

时间:2013-04-23 09:45:34

标签: r

我有以下数据框:

varnames<-c("ID", "a.1", "b.1", "c.1", "a.2", "b.2", "c.2")

a <-matrix (c(1,2,3,4, 5, 6, 7), 2,7)

colnames (a)<-varnames

df<-as.data.frame (a)


   ID  a.1  b.1  c.1  a.2  b.2  c.2
 1  1    3    5    7    2    4    6
 2  2    4    6    1    3    5    7

我想使用四分位数“a.1”,“b.1”和“c.1”对“a.2”,“b.2”和“c.2”列进行分类,相应地:

cat.a.2<-cut(df$a.2, c(-Inf, quantile(df$a.1), Inf))#categorizing a.2 using quartiles of a.1

cat.a.2
[1] (-Inf,3] (-Inf,3]
Levels: (-Inf,3] (3,3.25] (3.25,3.5] (3.5,3.75] (3.75,4] (4, Inf]

cat.b.2<-cut(df$b.2, c(-Inf, quantile(df$b.1), Inf))# categorizing b.2 using quartiles of b.1

cat.b.2
[1] (-Inf,5] (-Inf,5]
Levels: (-Inf,5] (5,5.25] (5.25,5.5] (5.5,5.75] (5.75,6] (6, Inf]


cat.c.2<-cut(df$c.2, c(-Inf, quantile(df$c.1), Inf))# categorizing c.2 using quartiles of c.1

 cat.c.2
[1] (5.5,7] (5.5,7]
Levels: (-Inf,1] (1,2.5] (2.5,4] (4,5.5] (5.5,7] (7, Inf]

有没有办法自动完成这项任务?

我天真地尝试了sapply():

quant.vars<-c("a.1","b.1", "c.1") # creating a vector of the names of variables which quartiles I am going to use
vars<-c("a.2","b.2", "c.2") # creating a vector of the names of variables which I am going to categorize
sapply (vars,FUN=function (x){cut (df [,x], quantile (df[,quant.vars], na.rm=T))})
         a.2        b.2          c.2       
[1,] "(1,3.25]" "(3.25,4.5]" "(5.75,7]"
[2,] "(1,3.25]" "(4.5,5.75]" "(5.75,7]"

当然,这不是我想要的结果。

此外,当将“Inf”添加到cut()范围时,我看到以下错误:

  

sapply(vars,FUN = function(x){cut(df [,x],c(quantile(df [,quant.vars],Inf),na.rm = T))})

  Error in quantile.default(df[, quant.vars], Inf) : 'probs' outside [0,1]

总之,我的问题是如何制作R:

  1. 计算后缀为1的变量的分位数(a.1。,b.1,c.1)

  2. 识别具有共同前缀的变量对(a.1和a.2,b.1和b.2,c.1和c.2)

  3. 在每对中,使用分位数对具有后缀2的变量进行分类,从具有后缀1的变量获得(a.2按a.1分位数分类,b.2按b.1分位数分类,c .2按c.1分位数分类)

  4. 非常感谢

1 个答案:

答案 0 :(得分:3)

这样的东西?

#find duplicated letters
temp <- do.call(rbind,strsplit(names(df)[-1],".",fixed=TRUE))
dup.temp <- temp[duplicated(temp[,1]),]

#loop for cut
res <- lapply(dup.temp[,1],function(i) {
  breaks <- c(-Inf,quantile(a[,paste(i,1,sep=".")]),Inf)
  cut(a[,paste(i,2,sep=".")],breaks)
})

#make list a data.frame
res <- do.call(cbind.data.frame,res)
names(res) <- paste("cut",dup.temp[,1],2,sep=".")

#    cut.a.2  cut.b.2 cut.c.2
# 1 (-Inf,3] (-Inf,5] (5.5,7]
# 2 (-Inf,3] (-Inf,5] (5.5,7]

res[,1]
# [1] (-Inf,3] (-Inf,3]
# Levels: (-Inf,3] (3,3.25] (3.25,3.5] (3.5,3.75] (3.75,4] (4, Inf]

如果速度是个问题,那么就有优化的空间。