Question

假设我有一个这样的数据框：

df<-data.frame(f=rep(c("a", "b", "c", "d"), 100), value=rnorm(400))

我想创建一个新列，它将包含观察所属的百分位数，并在每个因子级别单独计算。

这样做的合理简单有效的方法是什么？我最接近解决方案是

df$newColumn<-findInterval(df$value, tapply(df$value, df$f, quantile, probs=seq(0, 0.99, 0.01))$df[, "f"])

然而，这只是给所有观察结果零。 tapply返回一个四元素的分位数矢量列表，我不知道如何访问每个观察的相关元素作为findInterval函数的参数传递。

数据框中的行数可能达到几百万，因此速度也是一个问题。因子列将始终具有四个级别。

Answer 1

dplyr：

library(dplyr)

df %>% 
    group_by(f) %>% 
    mutate(quant = findInterval(value, quantile(value)))

#> Source: local data frame [400 x 3]
#> Groups: f [4]
#> 
#>         f       value quant
#>    <fctr>       <dbl> <int>
#> 1       a  0.51184061     3
#> 2       b  0.44362348     3
#> 3       c -1.04869448     1
#> 4       d -2.41772425     1
#> 5       a  0.10738332     3
#> 6       b -0.58630348     1
#> 7       c  0.34376820     3
#> 8       d  0.68322738     4
#> 9       a  1.00232314     4
#> 10      b  0.05499391     3
#> # ... with 390 more rows

使用data.table：

library(data.table)

dt <- setDT(df)
dt[, quant := findInterval(value, quantile(value)), by = f]
dt
#>      f      value quant
#>   1: a  0.3608395     3
#>   2: b -0.1028948     2
#>   3: c -2.1903336     1
#>   4: d  0.7470262     4
#>   5: a  0.5292031     3
#>  ---                   
#> 396: d -1.3475332     1
#> 397: a  0.1598605     3
#> 398: b -0.4261003     2
#> 399: c  0.3951650     3
#> 400: d -1.4409000     1

数据：

df <- data.frame(f = rep(c("a", "b", "c", "d"), 100), value = rnorm(400))

Answer 2

我认为data.table更快，但是，不使用包的解决方案是：

根据cut或findInterval以及quantile定义一项功能

cut2 <- function(x){
cut( x , breaks=quantile(x, probs = seq(0, 1, 0.01)) , include.lowest=T  , labels=1:100)
}

然后，使用ave

按因子应用它

df$newColumn <- ave(df$values, df$f, FUN=cut2)

对于每个观察，在由因子确定的子集上找到相应的百分位数

2 个答案: