R中的等频和等宽分箱

时间:2017-02-04 07:27:00

标签: r

给定一个数据集,我想使用相同的频率分级和等宽度分箱将其分成4个分区,如here所述,但我想使用R语言。

数据集:

0, 4, 12, 16, 16, 18, 24, 26, 28

我曾尝试为相等宽度的binning编写一些代码但它只生成一个直方图。

bins<-4;
minimumVal<-min(dataset)
maximumVal<-max(dataset)
width=(maximumVal-minimumVal)/bins;
edges = minimumVal:width:maximumVal;
hist(dataset, breaks = "Sturges", freq = TRUE, xlim = range(edges))

我是R的新手,所以对于在R中制作这两种分档有一点帮助,我们将不胜感激。

2 个答案:

答案 0 :(得分:4)

对于相等宽度分级,我建议使用classInt包:

dataset <- c(0, 4, 12, 16, 16, 18, 24, 26, 28)

library(classInt)
classIntervals(dataset, 4)
x <- classIntervals(dataset, 4, style = 'equal')

要使用休息时间,您可以查看x$brks

对于等频率分级,您可以使用相同的包,使用选项style = 'quantile'

classIntervals(dataset, 4, style = 'quantile')

由于dataset(16)中的重复值,并不是因为数据集不能精确地分配在4个数据库中,因为它们的数量完全相同元素,因为它有9个元素。我不知道这是否是一个问题,因为在提供的参考文献中,它说明了

  

&#34; ...每个组包含大约相同数量的值。&#34;

由于您没有明确说明您正在寻找的确切方法,我建议您使用this post作为另一种方法,在您的示例中它将是:

library(Hmisc)
table(cut2(dataset, m = length(dataset)/4))

此外,上面建议的链接中的其他帖子提供了有关这些方法的其他替代方案和一些相关讨论。

答案 1 :(得分:0)

您可以针对equal-width-binning尝试以下内容:

set.seed(1)
dataset <- runif(100, 0, 10) # some random data
bins<-4
minimumVal<-min(dataset)
maximumVal<-max(dataset)
width=(maximumVal-minimumVal)/bins;
cut(dataset, breaks=seq(minimumVal, maximumVal, width))

#[1] (2.58,5.03]  (2.58,5.03]  (5.03,7.47]  (7.47,9.92]  (0.134,2.58] (7.47,9.92]  (7.47,9.92]  (5.03,7.47]  (5.03,7.47]  (0.134,2.58] (0.134,2.58] (0.134,2.58]
#[13] (5.03,7.47]  (2.58,5.03]  (7.47,9.92]  (2.58,5.03]  (5.03,7.47]  (7.47,9.92]  (2.58,5.03]  (7.47,9.92]  (7.47,9.92]  (0.134,2.58] (5.03,7.47]  (0.134,2.58]
#[25] (2.58,5.03]  (2.58,5.03]  <NA>         (2.58,5.03]  (7.47,9.92]  (2.58,5.03]  (2.58,5.03]  (5.03,7.47]  (2.58,5.03]  (0.134,2.58] (7.47,9.92]  (5.03,7.47] 
#[37] (7.47,9.92]  (0.134,2.58] (5.03,7.47]  (2.58,5.03]  (7.47,9.92]  (5.03,7.47]  (7.47,9.92]  (5.03,7.47]  (5.03,7.47]  (7.47,9.92]  (0.134,2.58] (2.58,5.03] 
#[49] (5.03,7.47]  (5.03,7.47]  (2.58,5.03]  (7.47,9.92]  (2.58,5.03]  (0.134,2.58] (0.134,2.58] (0.134,2.58] (2.58,5.03]  (5.03,7.47]  (5.03,7.47]  (2.58,5.03] 
#[61] (7.47,9.92]  (2.58,5.03]  (2.58,5.03]  (2.58,5.03]  (5.03,7.47]  (0.134,2.58] (2.58,5.03]  (7.47,9.92]  (0.134,2.58] (7.47,9.92]  (2.58,5.03]  (7.47,9.92] 
#[73] (2.58,5.03]  (2.58,5.03]  (2.58,5.03]  (7.47,9.92]  (7.47,9.92]  (2.58,5.03]  (7.47,9.92]  (7.47,9.92]  (2.58,5.03]  (5.03,7.47]  (2.58,5.03]  (2.58,5.03] 
#[85] (7.47,9.92]  (0.134,2.58] (5.03,7.47]  (0.134,2.58] (0.134,2.58] (0.134,2.58] (0.134,2.58] (0.134,2.58] (5.03,7.47]  (7.47,9.92]  (7.47,9.92]  (7.47,9.92] 
#[97] (2.58,5.03]  (2.58,5.03]  (7.47,9.92]  (5.03,7.47] 
#Levels: (0.134,2.58] (2.58,5.03] (5.03,7.47] (7.47,9.92]

#plot frequencies in the bins
barplot(table(cut(dataset, breaks=seq(minimumVal, maximumVal, width))))

enter image description here