从大型数据集中进行重采样的差异差异估计

时间:2018-12-27 01:20:23

标签: r large-data difference estimation

我有一个大型数据集,可以在该数据集上进行差异比较估算。给定数据集的性质,我的t统计量分母会被夸大,系数(秘密地)在统计上很重要。 我想逐步减少数据库中的元素数量,并为每个步骤重新采样很多次,并每次都重新估计交互系数和标准误差。

然后,我要获取所有平均值估计值和标准误差,并将它们绘制在图表上,以显示它们在什么点(如果有的话)在统计上与零没有区别。

我的代码后面带有一个玩具示例。

  • 我不确定这是否是解决问题的最有效方法
  • 我无法获取并绘制出置信区间
  • 鉴于存在不同的群体,我不确定抽样是否具有代表性。

玩具示例(Creds Torres-Reyna-2015年)

library(foreign)
library(dplyr)
library(ggplot2)


df_0 <- NULL
for (i in 1:length(seq(5,nrow(mydata)-1,5))){
 index <- seq(5,nrow(mydata),5)[i]
 df_1 <- NULL
 for (j in 1:10){

  mydata_temp <- mydata[sample(nrow(mydata), index), ]    

  didreg = lm(y ~ treated + time + did, data = mydata_temp)
  out <- summary(didreg)
  new_line <- c(out$coefficients[,1][4], out$coefficients[,2][4], index)
  new_line <- data.frame(t(new_line))
  names(new_line) <- c("c","s","i")
  df_1 <- rbind(df_1,new_line)
  }
 df_0 <- rbind(df_0,df_1)
}

df_0 <- df_0 %>% group_by(i) %>% summarise(coefficient <- mean(c, na.rm = T),
                                          standard_error <- mean(s, na.rm = T)) 

names(df_0) <- c("i","c","s")
View(df_0)

2 个答案:

答案 0 :(得分:0)

最后,我像这样解决了它: 这是最有效的方法吗?

app = Flask(__name__)
app.config["MONGO_URI"] = "mongodb://ed:123@ds029227.mlab.com:2325/test"

mongo = PyMongo(app)

@app.route('/goal', methods=['POST'])
def add_goal():
    goal = mongo.db.goal
    position = request.json['position']
    orientation = request.json['orientation']
    goal_id = goal.insert({'position' : position, 'orientation' : 
orientation})
    new_goal = goal.find_one({'_id' : goal_id})
    output = {'position' : new_goal['position'], 'orientation' : 
new_goal['orientation']}
    position_x = json.loads(position['x'])
    position_y = json.loads(position['y'])

    return jsonify(output)

答案 1 :(得分:0)

使用基本R函数考虑以下重构代码:within%in%,嵌套lapplysetNamesaggregatedo.call 。这种方法避免了在循环中调用rbind并紧凑地重写代码,而无需不断使用$列引用。

library(foreign)

mydata = read.dta("http://dss.princeton.edu/training/Panel101.dta")

mydata <- within(mydata, {
  time <- ifelse(year >= 1994, 1, 0)
  treated <- ifelse(country %in% c("E", "F", "G"), 1, 0)
  did <- time * treated
})

# OUTER LIST OF DATA FRAMES
df_0_list <- lapply(1:length(seq(5,nrow(mydata)-1,5)), function(i) {      
  index <- seq(5,nrow(mydata),5)[i]

  # INNER LIST OF DATA FRAMES  
  df_1_list <- lapply(1:100, function(j) {        
    mydata_temp <- mydata[sample(nrow(mydata), index), ]    

    didreg <- lm(y ~ treated + time + did, data = mydata_temp)
    out <- summary(didreg)
    new_line <- c(out$coefficients[,1][4], out$coefficients[,2][4], index)
    new_line <- setNames(data.frame(t(new_line)), c("c","s","i"))
  })

  # APPEND ALL INNER DFS
  df <- do.call(rbind, df_1_list)
  return(df)
})

# APPEND ALL OUTER DFS
df_0 <- do.call(rbind, df_0_list)

# AGGREGATE WITH NEW COLUMNS
df_0 <- within(aggregate(cbind(c, s) ~ i, df_0, function(x) mean(x, na.rm=TRUE)), { 
               upper = c + s 
               lower = c - s 
        })

# RUN PLOT
within(df_0, {
  plot(i, c, ylim=c(min(c)-5000000000, max(c)+5000000000), type = "l",
       cex.lab=0.75, cex.axis=0.75, cex.main=0.75, cex.sub=0.75)
  polygon(c(i, rev(i)), c(lower, rev(upper)),
          col = "grey75", border = FALSE)
  lines(i, c, lwd = 2)
})

Plot Output