从R中的列中的参考值中查找类别

时间:2014-12-21 10:28:55

标签: r

我有以下数据和代码:

> dput(mydata)
structure(list(P3 = c(99.4, 105.8, 111.9), P5 = c(100.4, 106.9, 
113.1), P10 = c(102, 108.6, 114.9), P25 = c(104.8, 111.6, 118.1
), P50 = c(108, 115, 121.8), P75 = c(111.2, 118.6, 125.6), P90 = c(114.3, 
121.9, 129.1), P95 = c(116.1, 123.9, 131.3), P97 = c(117.4, 125.3, 
132.7), val = c(115.5, 112.7, 117)), .Names = c("P3", "P5", "P10", 
"P25", "P50", "P75", "P90", "P95", "P97", "val"), row.names = 7:9, class = "data.frame")
> 
> mydata
     P3    P5   P10   P25   P50   P75   P90   P95   P97   val
7  99.4 100.4 102.0 104.8 108.0 111.2 114.3 116.1 117.4 115.5
8 105.8 106.9 108.6 111.6 115.0 118.6 121.9 123.9 125.3 112.7
9 111.9 113.1 114.9 118.1 121.8 125.6 129.1 131.3 132.7 117.0

我想创建一个新列' categ'在mydata中将有#'第一列名称的一部分(从左到右检查),其值大于' val'那一行。

因此,我应该在新专栏中获得95,50,25。

我知道' findInterval'并且'匹配'用于此类分类的函数,但我无法将它们应用于mydata。谢谢你的帮助。

2 个答案:

答案 0 :(得分:3)

你可以尝试

indx <- max.col(mydata[,-10] >mydata$val,'first')
mydata$categ <- as.numeric(sub("[A-Z]+", "", names(mydata)[indx]))
mydata$categ
#[1] 95 50 25

或者

indx <- apply(mydata[,-10] > mydata$val, 1, function(x) names(which(x))[1])

然后像以前一样使用sub

数据

mydata <- structure(list(P3 = c(99.4, 105.8, 111.9), P5 = c(100.4, 106.9, 
113.1), P10 = c(102, 108.6, 114.9), P25 = c(104.8, 111.6, 118.1
), P50 = c(108, 115, 121.8), P75 = c(111.2, 118.6, 125.6), P90 = c(114.3, 
121.9, 129.1), P95 = c(116.1, 123.9, 131.3), P97 = c(117.4, 125.3, 
132.7), val = c(115.5, 112.7, 117)), .Names = c("P3", "P5", "P10", 
"P25", "P50", "P75", "P90", "P95", "P97", "val"), class = "data.frame",
row.names = c("7", "8", "9"))

答案 1 :(得分:1)

回答有关速度的问题:

bigdat<-mydata
for(j in 1:10) bigdat<- rbind(bigdat,bigdat)
frist<-function(mydata) {
    indx <- max.col(mydata[,-10] >mydata$val,'first')
mydata$categ <- as.numeric(sub("[A-Z]+", "", names(mydata)[indx]))
}

sceond <- function(mydata) indx <- apply(mydata[,-10] > mydata$val, 1, function(x) names(which(x))[1]) 
library(microbenchmark)
microbenchmark(frist(bigdat),sceond(bigdat))

Unit: milliseconds
           expr       min        lq    median        uq      max neval
  frist(bigdat)  5.400829  5.688074  7.166702  7.816168 142.6927   100
 sceond(bigdat) 22.333659 24.442536 25.422791 26.984677 178.7408   100

编辑:根据akrun的评论,我在sceond函数中添加了相同的正则表达式行,但它不会影响时间:

sceond <- function(mydata) {
    indx <- apply(mydata[,-10] > mydata$val, 1, function(x) names(which(x))[1]) 
    mydata$categ <- as.numeric(sub("[A-Z]+", "", names(mydata)[indx]))
    }
Unit: milliseconds
           expr       min        lq    median        uq       max neval
  frist(bigdat)  5.315901  5.613826  6.940932  7.791208  29.15699   100
 sceond(bigdat) 22.359897 24.588688 25.636795 27.868710 359.79325   100