对变量进行分箱并设置bin长度

时间:2017-11-10 14:01:55

标签: r

我正在尝试在r中存储一个变量,我想自己设置二进制文件的宽度。因此,变量将基于第一列进行分箱,并且我将根据以下参数获得r bin:

bin1 = 0.1
bin2 = 0.4
bin2 = 0.3
bin4 = 0.2

The output would look like this:

var_to_bin  binned_var
1           1
2           2
3           2
4           2
5           2
6           3
7           3
8           3
9           4
10          4

有谁知道这样做的方法?我找到的分箱功能可以根据我的var_to_bin设置bin范围,但是我希望r自动将分箱设置为预先指定大小的分位数。

3 个答案:

答案 0 :(得分:1)

您可以使用findIntervalquantilecumsum这样做。

dat$newBin <- findInterval(dat$var_to_bin,
                           vec=quantile(dat$var_to_bin, probs=cumsum(myProbs)),
                           rightmost.closed=TRUE) + 1L

这里,findInterval将矢量带到bin,以及切割点的矢量。切割点向量使用quantile构建,并为其提供所需分区概率的累积和。最后一个参数rightmost.closed确定每个分区的端点是包含(设置为关闭)还是排除(设置为打开)。

返回

dat
   var_to_bin binned_var newBin
1           1          1      1
2           2          2      2
3           3          2      2
4           4          2      2
5           5          2      2
6           6          3      3
7           7          3      3
8           8          3      3
9           9          4      4
10         10          4      4

数据

dat <-
structure(list(var_to_bin = 1:10, binned_var = c(1L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 4L, 4L)), .Names = c("var_to_bin", "binned_var"
), class = "data.frame", row.names = c(NA, -10L))

myProbs <- c(.1, .4, .3, .2)

答案 1 :(得分:0)

你可以用剪切来做到这一点。

var_to_bin = 1:10
as.numeric(cut(var_to_bin, include.lowest=TRUE,
   breaks=quantile(var_to_bin, probs=c(0,0.1,0.5,0.8,1))))
 [1] 1 2 2 2 2 3 3 3 4 4

答案 2 :(得分:0)

mltools查看bin_data()

# Here x is your var_to_bin
# We specify the bins end points cumulatively as quantiles.
# The result is an ordered factor whose levels represent the unique bins
# and whose values represent which bin each value of x falls into
# Note that these bins are "left-closed, right open" by default.

bin_data(x = 1:10, bins = c(0, 0.1, 0.5, 0.8, 1), binType = "quantile")

 [1] [1, 1.9)   [1.9, 5.5) [1.9, 5.5) [1.9, 5.5) [1.9, 5.5) [5.5, 8.2) [5.5, 8.2) [5.5, 8.2) [8.2, 10]  [8.2, 10] 
Levels: [1, 1.9) < [1.9, 5.5) < [5.5, 8.2) < [8.2, 10]