通过分组

时间:2016-03-08 18:29:07

标签: r statistics cut binning

我试图根据两个连续变量将数据框中的观察分为36组。更具体地说,我试图将两个变量中的每一个分成六组,然后将观察分组到36个不同的可能组中的一组中。

我的尝试在下面,这是有效的。但有没有更快的方法来避免双循环?

此外,这不是必要的,但是如何在6乘6网格中可视化每组中的观察总数?我知道table()会产生36个可能的组及其总数的列表,但不是网格格式。

set.seed(123)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
data <- data.frame(x1,x2)

labs1 <- levels(cut(x1, 6))
ints1 <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1", labs1)),
               upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labs1)))
labs2 <- levels(cut(x2, 6))
ints2 <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1", labs2)),
               upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labs2)))

tmp <- expand.grid(labs1, labs2)
groups <- cbind(lower1 =  as.numeric(sub("\\((.+),.*", "\\1", tmp[,1])), 
                upper1 = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", tmp[,1])), 
                lower2 = as.numeric(sub("\\((.+),.*", "\\1", tmp[,2])),
                upper2 = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", tmp[,2])))

for (i in 1:1000){
  for (j in 1:36){
    if (x1[i] >= groups[j,1] & x1[i] <= groups[j,2] &
        x2[i] >= groups[j,3] & x2[i] <= groups[j,4]){
      data$group[i] <- j
    }
  }
}

2 个答案:

答案 0 :(得分:0)

您可以使用混合的apply(),它将遍历您的data.framewhich(),它将通过您的群组array进行迭代:

data$group <- apply(data, 1, FUN=function(dataRow) 
  which(
    dataRow[1] >= groups[,1] & 
    dataRow[1] <= groups[,2] & 
    dataRow[2] >= groups[,3] & 
    dataRow[2] <= groups[,4]))

答案 1 :(得分:0)

你是在思考问题。获得6x6表格是一行table()。 (直接使用由cut(..., 6)创建的有用因子变量,不要只丢弃该因子,然后手动重新应用其级别并将变量加起来):

with(data, table(cut(x1, 6), cut(x2, 6)))

                 (-3.05,-1.97] (-1.97,-0.902] (-0.902,0.171] (0.171,1.24] (1.24,2.32] (2.32,3.4]
  (-2.82,-1.8]               2             10             11            7           3          0
  (-1.8,-0.793]              1             26             67           49          19          3
  (-0.793,0.216]            12             57            140          146          31          3
  (0.216,1.22]              11             49            109           95          36          6
  (1.22,2.23]                0             10             31           34          15          0
  (2.23,3.25]                0              3              5            6           2          1

# and to get the wide lines, you may need...
options('width'=199)

# or if you want more compact labels to keep it all narrow, use `cut(..., dig.lab)`
with(data, table(cut(x1, 6, dig.lab=2), cut(x2, 6, dig.lab=2)))

               (-3.1,-2] (-2,-0.9] (-0.9,0.17] (0.17,1.2] (1.2,2.3] (2.3,3.4]
  (-2.8,-1.8]          2        10          11          7         3         0
  (-1.8,-0.79]         1        26          67         49        19         3
  (-0.79,0.22]        12        57         140        146        31         3
  (0.22,1.2]          11        49         109         95        36         6
  (1.2,2.2]            0        10          31         34        15         0
  (2.2,3.2]            0         3           5          6         2         1

不可否认,table()cut()的文档都没有直接说明,可以使用像这样的2D示例。 =&GT; DOC /增强-错误