Question

我正在生成用于运行模拟的样本数据，我需要处理样本中的差异。我已经编写了代码，但我没有按预期获得差异。如何做到这一点需要一些帮助。此外，欢迎任何有关优化代码的建议！

首先，我使用以下代码生成示例数据 -

library("data.table")
set.seed(1200)
N_Blocks = 100 #My actual data has this around 1500 which take time for below for loop so restricted this to 100
cyc=200
City <- vector()
selected <- vector()
Census <- vector()

  City <- sample(paste("City", formatC(1, width=nchar(cyc), flag="0"), sep=""),N_Blocks,rep=T)
  selected <- sample(0:1,N_Blocks,rep = T)
  Census <- sample(0:200,N_Blocks,rep = T)


df1 <- data.frame(City,selected,Census)
str(df1)

现在我需要重复这些数据60个月（5年）和200套，其中几个月的差异如下：

City001 - City050 - + - 5％的差异

City051 - City100 - + - 10％的差异

City101 - City150 - + - 15％的差异

City151 - City200 - + - 20％的差异

我的数据库很大，我想使用data.table，但由于我无法做到，我已经编写了一个for循环，如下所示 -

df1  <- as.data.table(df1, row.names = NULL)

datalist <- list()

varlow <- 0.95
varhigh <- 1.05
sets=1
cyc=200
mov1 =13
M=72
seedno=1200

for (itr in 1:cyc){
  vec0 <- NULL
  vec0 <- as.vector(df1$Census)
  df1a <- df1

  set.seed(seedno)  ## seed for reproducability 
  for (m in mov1:M) {
    #set.seed(seedno)  ## seed for reproducability 
    for (l in 1:N_Blocks)  {

      vec0[l] <- ifelse(vec0[l]==0 , sample(0:3, 1, rep=T), 
                        sample(floor(vec0[l]*runif(1,varlow,1)):ceiling(vec0[l]*runif(1,1,varhigh)),1,rep=T))

    }

    df1a <- cbind(df1a, data.table(xx=vec0))
    names(df1a)[names(df1a)=="xx"]  <- paste0("M",m)
    df1a$varlow <- varlow
    df1a$varhigh <- varhigh
    df1a$set <- sets
    df1a$City <- sample(paste("City", formatC(itr, width=nchar(cyc), flag="0"), sep=""),N_Blocks,rep=T)


  }

  datalist[[itr]] <- df1a

  if(itr==50){
    varlow=0.90
    varhigh=1.10
    sets=2
  } 

  if(itr==100){
    varlow=0.85
    varhigh=1.15
    sets=3
  }

  if(itr==150){
    varlow=0.80
    varhigh=1.20
    sets=4
  }
}

df1_f <- NULL
df1_f = do.call(rbind, datalist)

此代码生成数据，200组相同的100条记录。然而，几个月的差异不是+ -5％，+ - 10％，+ - 15％，+ - 20％。

如果我使用下面的代码检查每个集合的增长情况，我看到增长不是预期的，即差异没有增加.....

report1 <- df1_f[,.(M24=sum(M24),
                    M36=sum(M36),
                    M48=sum(M48),
                    M60=sum(M60),
                    M72=sum(M72)),by=set]

增长率为-2.1％至1.8％，而我们已将差异调整至20％。

注意 - df1 $ Census中的值需要变化+ - 5％等。我将此值存储在vec0中并在for循环中使用。

我认为我遗漏了一些基本的东西，我怎样才能获得每组所需的样本数据？

谢谢！

代码优化+基于预定义的方差生成样本数据

0 个答案: