400个随机样本的平均置信区间

时间:2017-04-13 16:49:24

标签: r

您好我有关于R的问题。

实际上我有200名员工,我知道整个人口(工作时间)的平均值和标准差。

以下内容必须重复400次:

1)收集人口中6人的小型随机样本。

2)构建平均值(μ)的90%置信区间(假设种群大小为无限)

3)在2)中构建的400个置信区间中,有多少不包含整个总体的平均值(μ)。

我收集了样本,但我无法建立置信区间。

这是我到目前为止所做的:

> population<-data$hours01
> n<-6
> Vect <- rep(0,400)
> for(i in 1:400){
+ ech <- sample(population,n)
+ right[i]<-(mean(ech)) + 1.645*(((sd(ech))/sqrt(n)))
+ left[i]<-(mean(ech)) - 1.645*(((sd(ech))/sqrt(n)))

以下是数据

heur01
    1411
    1734
    1048
    2060
    1983
    1810
    1387
    1637
    1419
    1637
    1185
    1766
    1484
    1983
    1217
    1915
    1846
    1887
    1742
    988
    1375
    1193
    2056
    1919
    1850
    2076
    1463
    1113
    1887
    1919
    1734
    1157
    1766
    1951
    1923
    2173
    1609
    1895
    1109
    1028
    1701
    1875
    1677
    1653
    1883
    1677
    1850
    1738
    1520
    1415
    1992
    1919
    1653
    1625
    1705
    1742
    1891
    2108
    1919
    1911
    1770
    1834
    1911
    2060
    1717
    1943
    1859
    1738
    1222
    1709
    2052
    1141
    1931
    2068
    2044
    1725
    1818
    1798
    1943
    1939
    1919
    1790
    2116
    1750
    2052
    1605
    1798
    2169
    1665
    1673
    1185
    1717
    1717
    1657
    1915
    1778
    2121
    1786
    1774
    2056
    1738
    1883
    1754
    1790
    1770
    1947
    1867
    1794
    1867
    1790
    1762
    2080
    1778
    1903
    1734
    1838
    1560
    1592
    1637
    1467
    1750
    1653
    1222
    1709
    1806
    1334
    1584
    2052
    1802
    1774
    1770
    1258
    1334
    1322
    1826
    1600
    2189
    1907
    1548
    1617
    1693
    1020
    992
    1435
    1613
    1738
    1419
    1121
    1629
    1605
    1455
    1157
    1717
    1294
    1359
    1282
    1758
    1395
    1129
    1189
    1790
    1217
    1133
    1516
    1516
    1278
    1072
    911
    1286
    968
    1076
    1315
    1221
    1268
    939
    1879
    986
    1221
    1456
    1315
    1785
    1080
    1362
    1503
    1127
    1691
    1174
    1644
    1691
    939
    1503
    1080
    1503
    1832
    1362
    1691
    1456
    1879
    1644
    1033

1 个答案:

答案 0 :(得分:1)

您可以构建一个函数来计算置信区间,然后将其应用于replicate的样本,以生成置信区间矩阵,您可以根据总体均值进行检查。

可能存在并发症:when standard deviation is unknown, confidence intervals are calculated with the t distribution, but if it is, the cumulative normal is used。如果自由度相对较大,则会产生很小的差异,但考虑到每个样本只有5个,这里的差异很重要。

因此,要为置信区间构建一个健壮的函数,你需要像

这样的东西
ci <- function(x, conf.level, sd = NULL){
    conf.level <- mean(c(conf.level, 1))
    mean.x <- mean(x)
    if (is.null(sd)) {    # when standard deviation unknown,
        sd <- sd(x)    # use sample standard deviation
        z <- qt(conf.level, length(x) - 1)    # and t distribution
    } else {
        z <- qnorm(conf.level)    # when known, use normal
    }
    int <- z * sd / sqrt(length(x))
    c(low = mean.x - int, 
      high = mean.x + int)
}

尝试一下,

set.seed(47)    # make sampling reproducible

# make a matrix of confidence intervals
ints <- replicate(400, ci(sample(heur01, 6), .9, sd(heur01)))

ints[, 1:5]
#>          [,1]     [,2]     [,3]     [,4]     [,5]
#> low  1443.959 1441.625 1376.459 1486.625 1436.959
#> high 1865.041 1862.708 1797.541 1907.708 1858.041

# calculate number of intervals that don't contain mean
mean.x <- mean(heur01)
sum(mean.x < ints[1,] | mean.x > ints[2,])
#> [1] 37

事实上,当没有指定标准偏差时,它确实是不同的,

set.seed(47)
with_sd <- replicate(100, {
    ints <- replicate(400, ci(sample(heur01, 6), .9, sd(heur01)))
    sum(mean.x < ints[1,] | mean.x > ints[2,])
})
summary(with_sd)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    27.0    34.0    37.0    37.5    41.0    50.0

set.seed(47)
no_sd <- replicate(100, {
    ints <- replicate(400, ci(sample(heur01, 6), .9))
    sum(mean.x < ints[1,] | mean.x > ints[2,])
})
summary(no_sd)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   29.00   43.00   46.00   47.07   52.00   66.00

t.test(with_sd, no_sd)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  with_sd and no_sd
#> t = -11.472, df = 187.14, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -11.215668  -7.924332
#> sample estimates:
#> mean of x mean of y 
#>     37.50     47.07

数据

heur01 <- c(1411L, 1734L, 1048L, 2060L, 1983L, 1810L, 1387L, 1637L, 1419L, 1637L, 1185L, 1766L, 1484L, 1983L, 
    1217L, 1915L, 1846L, 1887L, 1742L, 988L, 1375L, 1193L, 2056L, 1919L, 1850L, 2076L, 1463L, 1113L, 1887L, 
    1919L, 1734L, 1157L, 1766L, 1951L, 1923L, 2173L, 1609L, 1895L, 1109L, 1028L, 1701L, 1875L, 1677L, 1653L, 
    1883L, 1677L, 1850L, 1738L, 1520L, 1415L, 1992L, 1919L, 1653L, 1625L, 1705L, 1742L, 1891L, 2108L, 1919L, 
    1911L, 1770L, 1834L, 1911L, 2060L, 1717L, 1943L, 1859L, 1738L, 1222L, 1709L, 2052L, 1141L, 1931L, 2068L, 
    2044L, 1725L, 1818L, 1798L, 1943L, 1939L, 1919L, 1790L, 2116L, 1750L, 2052L, 1605L, 1798L, 2169L, 1665L, 
    1673L, 1185L, 1717L, 1717L, 1657L, 1915L, 1778L, 2121L, 1786L, 1774L, 2056L, 1738L, 1883L, 1754L, 1790L, 
    1770L, 1947L, 1867L, 1794L, 1867L, 1790L, 1762L, 2080L, 1778L, 1903L, 1734L, 1838L, 1560L, 1592L, 1637L, 
    1467L, 1750L, 1653L, 1222L, 1709L, 1806L, 1334L, 1584L, 2052L, 1802L, 1774L, 1770L, 1258L, 1334L, 1322L, 
    1826L, 1600L, 2189L, 1907L, 1548L, 1617L, 1693L, 1020L, 992L, 1435L, 1613L, 1738L, 1419L, 1121L, 1629L, 
    1605L, 1455L, 1157L, 1717L, 1294L, 1359L, 1282L, 1758L, 1395L, 1129L, 1189L, 1790L, 1217L, 1133L, 1516L, 
    1516L, 1278L, 1072L, 911L, 1286L, 968L, 1076L, 1315L, 1221L, 1268L, 939L, 1879L, 986L, 1221L, 1456L, 
    1315L, 1785L, 1080L, 1362L, 1503L, 1127L, 1691L, 1174L, 1644L, 1691L, 939L, 1503L, 1080L, 1503L, 1832L, 
    1362L, 1691L, 1456L, 1879L, 1644L, 1033L)