如何同时在几个变量的前n个百分位中找到案例?

时间:2016-11-12 06:19:11

标签: r data-manipulation

想象一下,我们有一个这样的数据框:

df <- data.frame(x = seq(10, 20), y = seq(8, 18), z = seq(0, 10))

    x  y  z
1  10  8  0
2  11  9  1
3  12 10  2
4  13 11  3
5  14 12  4
6  15 13  5
7  16 14  6
8  17 15  7
9  18 16  8
10 19 17  9
11 20 18 10

我们如何在所有X,Y和Z上选择最高百分位的案例?我需要一个代码,在所有变量中搜索前1%的案例,然后如果没有找到任何内容,则将标准放宽到2%,然后放宽3%,依此类推,直到找到m个案例中所有变量百分比最高的情况。变量。我们需要根据需要设置m。

4 个答案:

答案 0 :(得分:1)

我认为这应该适合你:

df<-data.frame(x=seq(10,20), y=seq(8,18), z=seq(0,10))

#defining function - df is input frame, cases is the 'm' you are looking for
#startingperc is just the percentage level you want to start with and tickrate
#is the rate at which you decrease the perentile until you get m cases
myfunc <- function(df, cases, startingperc, tickrate){
  found <- 0
  while(found < cases) {
    quants <- apply(df, 2, quantile, probs = startingperc)
    indices <- which(apply(df, 1, function(x) all(x > quants)) == TRUE)
    found <- length(indices)
    if(found < cases) {startingperc <- startingperc - tickrate}
  }
  #added this to handle a tickrate that is too large
  if (length(indices) > cases) {
    indices <- rev(indices[order(apply(df[indices,],1, sum), decreasing = T)[1:cases]])
  }
  return(df[indices,])
}

#in use
myfunc(df, 5, .99, .01)

,并提供:

> myfunc(df, 5, .99, .01)
    x  y  z
7  16 14  6
8  17 15  7
9  18 16  8
10 19 17  9
11 20 18 10

答案 1 :(得分:1)

你可以做这样的事情,知道在分位数的序列中有多少观察。您可以修改此函数以获取这些行的索引。您也可以更改分位数的迭代索引。

lapply(lapply(seq(0.9,0.1,-0.1), function(xx) Reduce(intersect, lapply(df, function(x) which(x>=quantile(x, probs = xx))))), length)

检查数值大于分位数的obs,然后与所有列进行交集以获得公共指数。然后我提供迭代的分位数矢量。然后我简单地计算长度。

答案 2 :(得分:1)

虽然严格来说并不是必要的(你可以只查找最小百分位并使用ceiling),但这对于递归函数来说是一个很好的例子:

fun <- function(n_rows = 1, pct = 1, dat = df){
    # This part doesn't need to be repeated. Uses dplyr::percent_rank to calculate 
    # percentiles, and sums each row of percentiles.
    row_sums <- rowSums(matrix(1 - dplyr::percent_rank(dat), 
                               ncol = ncol(dat)))
    fun2 <- function(p = pct){    # defines a recursive function
        # calculates if each row is below percentile threshold
        working_rows <- row_sums <= p / 100 * ncol(dat)
        if(sum(working_rows) >= n_rows){    # if enough rows,
            dat[working_rows, ]    # returns them
        } else {
            fun2(p + 1)    # else calls itself, incrementing the threshold
        }
    }
    fun2(pct)    # call recursive function with initial percentile
}

fun()
##     x  y  z
## 11 20 18 10

fun(3)
##     x  y  z
## 9  18 16  8
## 10 19 17  9
## 11 20 18 10

fun(n_rows = 1, pct = 50)
##     x  y  z
## 7  16 14  6
## 8  17 15  7
## 9  18 16  8
## 10 19 17  9
## 11 20 18 10

请注意,这会对单个组中列的所有值进行排名。要单独对每列进行排名,请将row_sums行替换为

row_sums <- rowSums(sapply(dat, dplyr::percent_rank))

答案 3 :(得分:0)

您可以创建一个函数来查找列百分位成员资格并使用它:

df<-data.frame(x=100:900, y=1100: 1900, z=2800:2000) 
tail(df)    
# percentile membership of a column    

getPercentile<- function (datacol) 
{
    as.numeric(cut(datacol, breaks = quantile(datacol, probs = seq(0, 
        1, by = 0.01)), labels = as.character(1:100), include.lowest = TRUE))
}

getPercentile(df$x)

#get columwise percentile membership of all columns
res<- as.data.frame(apply( df,2,getPercentile ))

colnames(res)

#filter any way you want 
# bottom 2 % of first two and top 90% of last 
res[res$x<=2 & res$y<=2 & res$z>=90, ]