Question

我正在尝试模拟每月数据面板，其中一个变量取决于R中该变量的滞后值。我的解决方案非常慢。我需要大约1000个样本的2545个人，每个人每月都要观察多年，但是第一个样本花了我的计算机8.5小时来构建。我怎样才能让它更快？

我首先创建一个不平衡的人员组，其中包含不同的出生日期，月度年龄以及变量xbsmall和error，这些人将进行比较以确定Outcome。第一个块中的所有代码都只是数据设置。

# Setup:
library(plyr)

# Would like to have 2545 people (nPerson). 
#Instead use 4 for testing.
nPerson = 4
# Minimum and maximum possible ages and birth dates
AgeMin = 10
AgeMax = 50
BornMin = 1950
BornMax = 1963

# Person-specific characteristics
ind = 
  data.frame(
    id = 1:nPerson,
    BornYear = floor(runif(length(1:nPerson), min=BornMin, max=BornMax+1)),
    BornMonth = ceiling(runif(length(1:nPerson), min=0, max=12))
  )

# Make an unbalanced panel of people over age 10 up to year 1986
# panel = ddply(ind, ~id, transform, AgeMonths = BornMonth)
panel = ddply(ind, ~id, transform, AgeMonths = (AgeMin*12):((1986-BornYear)*12 + 12-BornMonth))

# Set up some random variables to approximate the data generating process
panel$xbsmall = rnorm(dim(panel)[1], mean=-.3, sd=.45)
# Standard normal error for probit
panel$error = rnorm(dim(panel)[1])

# Placeholders
panel$xb = rep(0, dim(panel)[1])
panel$Outcome = rep(0, dim(panel)[1])

现在我们有了数据，这里是一个很慢的部分（我的计算机只有一秒钟，只有4次观察，但是数千次观察的时间很短）。每个月，一个人从两个不同的正态分布中获得两个绘制（xbsmall和error）（这些在上面完成），Outcome == 1如果xbsmall > error。但是，如果Outcome在上个月等于1，则当前月份中的Outcome等于xbsmall + 4.47 > error时为1。我在下面的代码中使用xb = xbsmall+4.47（xb是＆＃34;线性预测器＆＃34;在概率模型中）。为简单起见，我忽略了每个人的第一个月。为了您的信息，这是模拟一个probit DGP（但没有必要知道解决计算速度的问题）。

# Outcome == 1 if and only if xb > -error
# The hard part: xb includes information about the previous month's outcome
start_time = Sys.time()
for(i in 1:nPerson){
  # Determine the range of monthly ages to loop over for this person
  AgeMonthMin = min(panel$AgeMonths[panel$id==i], na.rm=T)
  AgeMonthMax = max(panel$AgeMonths[panel$id==i], na.rm=T)
  # Loop over the monthly ages for this person and determine the outcome
  for(t in (AgeMonthMin+1):AgeMonthMax){
    # Indicator for whether Outcome was 1 last month
    panel$Outcome1LastMonth[panel$id==i & panel$AgeMonths==t] = panel$Outcome[panel$id==i & panel$AgeMonths==t-1] 
    # xb = xbsmall + 4.47 if Outcome was 1 last month
    # Otherwise, xb = xbsmall
    panel$xb[panel$id==i & panel$AgeMonths==t] = with(panel[panel$id==i & panel$AgeMonths==t,], xbsmall + 4.47*Outcome1LastMonth)
    # Outcome == 1 if xb > 0
    panel$Outcome[panel$id==i & panel$AgeMonths==t] =
      ifelse(panel$xb[panel$id==i & panel$AgeMonths==t] > - panel$error[panel$id==i & panel$AgeMonths==t], 1, 0)
    }
  }
end_time = Sys.time()
end_time - start_time

我想减少计算机时间：

cumsum()
我不知道的一些精彩的面板数据功能
找到一种方法让t循环遍历每个人的相同起点和终点，然后以某种方式使用plyr::ddpl()或dplyr::gather_by()
迭代解决方案：对每个月龄（例如，模式）的Outcome值进行有根据的猜测，并以某种方式调整与上个月不匹配的值。这在我的实际应用中会更好用，因为xbsmall的年龄趋势非常明显。
仅对较小的样本进行模拟，然后估计样本大小对我需要的值的影响（此处未计算回归系数估计的分布）

Answer 1

一种方法是使用split-apply-combine方法。我取出marriage->male循环并将内容放在函数中：

for(t in (AgeMonthMin+1):AgeMonthMax)

其中generate_outcome <- function(x) { AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE) AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE) for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){ x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1] x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0) } x }是一个人的数据框。这允许我们简化x构造。现在我们可以做到

panel$id==i & panel$AgeMonths==t

和out <- lapply(split(panel, panel$id), generate_outcome) out <- do.call(rbind, out)返回all.equal(panel$Outcome, out$Outcome)。使用这种方法计算100人需要1.8秒，而原始代码则为1.5分钟。

R：快速模拟不平衡面板，其变量取决于其自身的滞后值

1 个答案: