使用apply和rbind构建R data.frame

时间:2011-09-17 11:03:52

标签: r

我有一个包含一些初始值的现有data.frame。我想要做的是创建另一个data.frame,其中第一个data.frame中的每一行都有10个随机采样的行。我也试图以R方式做这个,所以我想避免迭代。

到目前为止,我已经设法将一个函数应用于表中生成一个值的每一行,但是我不知道如何将其扩展为每个应用程序生成10行,然后将结果重新绑定。

到目前为止,这是我的进展:

示例数据:

   starts <- structure(list(instance = structure(21:26, .Label = c("big_1", 
   "big_10", "big_11", "big_12", "big_13", "big_14", "big_15", "big_16", 
   "big_17", "big_18", "big_19", "big_2", "big_20", "big_3", "big_4", 
   "big_5", "big_6", "big_7", "big_8", "big_9", "competition01", 
   "competition02", "competition03", "competition04", "competition05", 
   "competition06", "competition07", "competition08", "competition09", 
   "competition10", "competition11", "competition12", "competition13", 
   "competition14", "competition15", "competition16", "competition17", 
   "competition18", "competition19", "competition20", "med_1", "med_10", 
   "med_11", "med_12", "med_13", "med_14", "med_15", "med_16", "med_17", 
   "med_18", "med_19", "med_2", "med_20", "med_3", "med_4", "med_5", 
   "med_6", "med_7", "med_8", "med_9", "small_1", "small_10", "small_11", 
   "small_12", "small_13", "small_14", "small_15", "small_16", "small_17", 
   "small_18", "small_19", "small_2", "small_20", "small_3", "small_4", 
   "small_5", "small_6", "small_7", "small_8", "small_9"), class = "factor"), 
   event.clashes = c(674L, 626L, 604L, 1036L, 991L, 929L), overlaps = c(0L, 
   0L, 0L, 0L, 0L, 0L), room.valid = c(324L, 320L, 268L, 299L, 
   294L, 220L), final.timeslot = c(0L, 0L, 0L, 0L, 0L, 0L), 
   three.in.a.row = c(246L, 253L, 259L, 389L, 365L, 430L), single.event = c(97L, 
   120L, 97L, 191L, 150L, 138L)), .Names = c("instance", "event.clashes", 
   "overlaps", "room.valid", "final.timeslot", "three.in.a.row", 
   "single.event"), row.names = c(NA, 6L), class = "data.frame")

代码:

   library(reshape)
   m.starts <- melt(starts)

   df <- data.frame()

   gen.data <- function(x){
       inst <- x[1]
       constr <- x[2]
       v <- as.integer(x[3])
       val <- as.integer(rnorm(1, max(0, v), v / 2))
       # Should probably return a data.frame here
       print(paste(inst, constr, val))
   }

   apply(m.starts, 1, gen.data)

3 个答案:

答案 0 :(得分:6)

我不清楚你在做什么,但你的gen_data函数的以下更改似乎可以做你想要的。具体来说,我不清楚你在使用val做什么,因为这似乎只是生成一个随机数,该行的值列的平均值和该值的标准偏差除以2。那是你要的吗?我在您的函数中添加了一个新参数,以便考虑您要生成的行数:

gen.data <- function(x, nreps = 10){
    inst <- x[1]
        constr <- x[2]
        v <- as.integer(x[3])
        val <- as.integer(rnorm(nreps, max(0, v), v / 2))

        out <- data.frame(inst = rep(inst, nreps)
            , constr = rep(constr, nreps)
         , val = val)

    return(out)
       }

然后使用:

do.call("rbind", apply(m.starts, 1, gen.data))

结果:

             inst         constr  val
1   competition01  event.clashes  876
2   competition01  event.clashes  714
3   competition01  event.clashes  912
4   competition01  event.clashes  -46
5   competition01  event.clashes  369
....
....
357 competition06   single.event  149
358 competition06   single.event  248
359 competition06   single.event  128
360 competition06   single.event  168

答案 1 :(得分:1)

无需applyrbind。只需要一个简单的向量子集:

samples <- sample(1:nrow(starts), nrow(starts)*10, replace=TRUE)
starts[samples, 1:3]

前5行结果:

> head(starts[samples, 1:3], 5)

         instance event.clashes overlaps
2   competition02           626        0
5   competition05           991        0
6   competition06           929        0
4   competition04          1036        0
2.1 competition02           626        0

答案 2 :(得分:0)

你可以将Andrie和Chase的解决方案的想法结合起来如下:

#Repeat each row ten times
start.m1 <- start.m[rep(1:nrow(start.m),each = 10),]

#Create extended vector to use to define 
# means/sd
m <- rep(start.m$value,each = 10)

#Remove negative values; 
# although none were in your data
m[m <= 0] <- 0

#Replace value with rnorm values
start.m1$value <- rnorm(nrow(start.m1), mean = m, sd = m / 2)

产生如下所示的内容:

> head(start.m1)
         instance      variable     value
1   competition01 event.clashes 1098.0220
1.1 competition01 event.clashes 1208.4304
1.2 competition01 event.clashes  883.7976
1.3 competition01 event.clashes  365.1396
1.4 competition01 event.clashes  862.3113
1.5 competition01 event.clashes 1352.7085

我正在使用Andrie的建议来使用子集索引来扩展数据框,然后使用Chase对您的问题的解释,其中您似乎希望通过rnorm实际生成值,而不是重新采样原始值行自己。这里的关键是rnorm是矢量化的。