想象一下,我们有一个这样的数据框:
df <- data.frame(x = seq(10, 20), y = seq(8, 18), z = seq(0, 10))
x y z
1 10 8 0
2 11 9 1
3 12 10 2
4 13 11 3
5 14 12 4
6 15 13 5
7 16 14 6
8 17 15 7
9 18 16 8
10 19 17 9
11 20 18 10
我们如何在所有X,Y和Z上选择最高百分位的案例?我需要一个代码,在所有变量中搜索前1%的案例,然后如果没有找到任何内容,则将标准放宽到2%,然后放宽3%,依此类推,直到找到m个案例中所有变量百分比最高的情况。变量。我们需要根据需要设置m。
答案 0 :(得分:1)
我认为这应该适合你:
df<-data.frame(x=seq(10,20), y=seq(8,18), z=seq(0,10))
#defining function - df is input frame, cases is the 'm' you are looking for
#startingperc is just the percentage level you want to start with and tickrate
#is the rate at which you decrease the perentile until you get m cases
myfunc <- function(df, cases, startingperc, tickrate){
found <- 0
while(found < cases) {
quants <- apply(df, 2, quantile, probs = startingperc)
indices <- which(apply(df, 1, function(x) all(x > quants)) == TRUE)
found <- length(indices)
if(found < cases) {startingperc <- startingperc - tickrate}
}
#added this to handle a tickrate that is too large
if (length(indices) > cases) {
indices <- rev(indices[order(apply(df[indices,],1, sum), decreasing = T)[1:cases]])
}
return(df[indices,])
}
#in use
myfunc(df, 5, .99, .01)
,并提供:
> myfunc(df, 5, .99, .01)
x y z
7 16 14 6
8 17 15 7
9 18 16 8
10 19 17 9
11 20 18 10
答案 1 :(得分:1)
你可以做这样的事情,知道在分位数的序列中有多少观察。您可以修改此函数以获取这些行的索引。您也可以更改分位数的迭代索引。
lapply(lapply(seq(0.9,0.1,-0.1), function(xx) Reduce(intersect, lapply(df, function(x) which(x>=quantile(x, probs = xx))))), length)
检查数值大于分位数的obs,然后与所有列进行交集以获得公共指数。然后我提供迭代的分位数矢量。然后我简单地计算长度。
答案 2 :(得分:1)
虽然严格来说并不是必要的(你可以只查找最小百分位并使用ceiling
),但这对于递归函数来说是一个很好的例子:
fun <- function(n_rows = 1, pct = 1, dat = df){
# This part doesn't need to be repeated. Uses dplyr::percent_rank to calculate
# percentiles, and sums each row of percentiles.
row_sums <- rowSums(matrix(1 - dplyr::percent_rank(dat),
ncol = ncol(dat)))
fun2 <- function(p = pct){ # defines a recursive function
# calculates if each row is below percentile threshold
working_rows <- row_sums <= p / 100 * ncol(dat)
if(sum(working_rows) >= n_rows){ # if enough rows,
dat[working_rows, ] # returns them
} else {
fun2(p + 1) # else calls itself, incrementing the threshold
}
}
fun2(pct) # call recursive function with initial percentile
}
fun()
## x y z
## 11 20 18 10
fun(3)
## x y z
## 9 18 16 8
## 10 19 17 9
## 11 20 18 10
fun(n_rows = 1, pct = 50)
## x y z
## 7 16 14 6
## 8 17 15 7
## 9 18 16 8
## 10 19 17 9
## 11 20 18 10
请注意,这会对单个组中列的所有值进行排名。要单独对每列进行排名,请将row_sums
行替换为
row_sums <- rowSums(sapply(dat, dplyr::percent_rank))
答案 3 :(得分:0)
您可以创建一个函数来查找列百分位成员资格并使用它:
df<-data.frame(x=100:900, y=1100: 1900, z=2800:2000)
tail(df)
# percentile membership of a column
getPercentile<- function (datacol)
{
as.numeric(cut(datacol, breaks = quantile(datacol, probs = seq(0,
1, by = 0.01)), labels = as.character(1:100), include.lowest = TRUE))
}
getPercentile(df$x)
#get columwise percentile membership of all columns
res<- as.data.frame(apply( df,2,getPercentile ))
colnames(res)
#filter any way you want
# bottom 2 % of first two and top 90% of last
res[res$x<=2 & res$y<=2 & res$z>=90, ]