如何正确模拟数据?

时间:2014-11-26 14:34:30

标签: r

嗨,我是R的新手,想问更一般的问题。如何模拟或创建适合在此处发布并同时具有可重复性的示例数据集。例如,我想创建一个数值示例,它可以正确地抽象我的数据集。一个条件是在我的依赖变量和自变量之间实现一些相关性。 例如。如何在我的点数与我的in.var1in.var2之间引入一些相关性?

set.seed(1122)  
count<-rpois(1000,30)  
in.var1<- rnorm(1000, mean = 25, sd = 3)
in.var1<- rnorm(1000, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)

3 个答案:

答案 0 :(得分:3)

您可以通过添加&#34;信息的某些部分来引入依赖性。在两个变量中构造count变量:

     set.seed(1222)  
                in.var1<- rnorm(1000, mean = 25, sd = 3)
      #Corrected spelling of in.var2
                in.var2<- rnorm(1000, mean = 12, sd = 2)
    count<-rpois(1000,30) + 0.15*in.var1 + 0.3*in.var2
    # Avoid use 'data` as an object name
    dat<-data.frame(count,in.var1,in.var2)

> spearman(count, in.var1)
       rho 
0.06859676 
> spearman(count, in.var2)
      rho 
0.1276568 
> spearman(in.var1, in.var2)
        rho 
-0.02175273 

> summary( glm(count ~ in.var1 + in.var2, data=dat) )

Call:
glm(formula = count ~ in.var1 + in.var2, data = dat)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-16.6816   -3.6910   -0.4238    3.4435   15.5326  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.05034    1.74084  16.688  < 2e-16 ***
in.var1      0.14701    0.05613   2.619  0.00895 ** 
in.var2      0.35512    0.08228   4.316 1.74e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

答案 1 :(得分:1)

如果您希望count成为in.var1invar.2的函数,请尝试此操作。请注意,count已经是函数名称,因此我将其更改为Count

set.seed(1122)
in.var1<- rnorm(1000, mean = 4, sd = 3)
in.var2<- rnorm(1000, mean = 6, sd = 2)
Count<-rpois(1000, exp(3+ 0.5*in.var1 - 0.25*in.var2))
Data<-data.frame(Count=Count, Var1=in.var1, Var2=in.var2)

您现在拥有基于in.var1in.var2的泊松计数。泊松回归将显示截距3和Var1的系数为0.5,Var2的系数为-0.25

 summary(glm(Count~Var1+Var2,data=Data, family=poisson))

Call:
glm(formula = Count ~ Var1 + Var2, family = poisson, data = Data)

 Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
   -2.84702  -0.76292  -0.04463   0.67525   2.79537  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.001390   0.011782   254.7   <2e-16 ***
Var1         0.499789   0.001004   498.0   <2e-16 ***
Var2        -0.250949   0.001443  -173.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 308190.7  on 999  degrees of freedom
Residual deviance:   1063.3  on 997  degrees of freedom
AIC: 6319.2

Number of Fisher Scoring iterations: 4

答案 2 :(得分:0)

据我了解,您希望为数据添加一些模式。

# Basic info taken from Data Science Exploratory Analysis Course
# http://datasciencespecialization.github.io/courses/04_ExploratoryAnalysis/

set.seed(1122)  

rowNumber = 1000

count<-rpois(rowNumber,30)  
in.var1<- rnorm(rowNumber, mean = 25, sd = 3)
in.var2<- rnorm(rowNumber, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)


dataNew <- data



for (i in 1:rowNumber) {
  # flip a coin
  coinFlip <- rbinom(1, size = 1, prob = 0.5)
  # if coin is heads add a common pattern to that row
  if (coinFlip) {
    dataNew[i,"count"] <- 2 * data[i,"in.var1"] + 10*   data[i,"in.var2"]
  }
}

基本上,我将一个模式count = 2 * in.var1 + 10 * in.var2添加到一些随机行,这里是coinFlip变量。当然你应该将它矢量化为更多行。