Question

您好我有关于R的问题。

实际上我有200名员工，我知道整个人口（工作时间）的平均值和标准差。

以下内容必须重复400次：

1）收集人口中6人的小型随机样本。

2）构建平均值（μ）的90％置信区间（假设种群大小为无限）

3）在2）中构建的400个置信区间中，有多少不包含整个总体的平均值（μ）。

我收集了样本，但我无法建立置信区间。

这是我到目前为止所做的：

> population<-data$hours01
> n<-6
> Vect <- rep(0,400)
> for(i in 1:400){
+ ech <- sample(population,n)
+ right[i]<-(mean(ech)) + 1.645*(((sd(ech))/sqrt(n)))
+ left[i]<-(mean(ech)) - 1.645*(((sd(ech))/sqrt(n)))

以下是数据

Answer 1

您可以构建一个函数来计算置信区间，然后将其应用于replicate的样本，以生成置信区间矩阵，您可以根据总体均值进行检查。

可能存在并发症：when standard deviation is unknown, confidence intervals are calculated with the t distribution, but if it is, the cumulative normal is used。如果自由度相对较大，则会产生很小的差异，但考虑到每个样本只有5个，这里的差异很重要。

因此，要为置信区间构建一个健壮的函数，你需要像

这样的东西

ci <- function(x, conf.level, sd = NULL){
    conf.level <- mean(c(conf.level, 1))
    mean.x <- mean(x)
    if (is.null(sd)) {    # when standard deviation unknown,
        sd <- sd(x)    # use sample standard deviation
        z <- qt(conf.level, length(x) - 1)    # and t distribution
    } else {
        z <- qnorm(conf.level)    # when known, use normal
    }
    int <- z * sd / sqrt(length(x))
    c(low = mean.x - int, 
      high = mean.x + int)
}

尝试一下，

set.seed(47)    # make sampling reproducible

# make a matrix of confidence intervals
ints <- replicate(400, ci(sample(heur01, 6), .9, sd(heur01)))

ints[, 1:5]
#>          [,1]     [,2]     [,3]     [,4]     [,5]
#> low  1443.959 1441.625 1376.459 1486.625 1436.959
#> high 1865.041 1862.708 1797.541 1907.708 1858.041

# calculate number of intervals that don't contain mean
mean.x <- mean(heur01)
sum(mean.x < ints[1,] | mean.x > ints[2,])
#> [1] 37

事实上，当没有指定标准偏差时，它确实是不同的，

set.seed(47)
with_sd <- replicate(100, {
    ints <- replicate(400, ci(sample(heur01, 6), .9, sd(heur01)))
    sum(mean.x < ints[1,] | mean.x > ints[2,])
})
summary(with_sd)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    27.0    34.0    37.0    37.5    41.0    50.0

set.seed(47)
no_sd <- replicate(100, {
    ints <- replicate(400, ci(sample(heur01, 6), .9))
    sum(mean.x < ints[1,] | mean.x > ints[2,])
})
summary(no_sd)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   29.00   43.00   46.00   47.07   52.00   66.00

t.test(with_sd, no_sd)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  with_sd and no_sd
#> t = -11.472, df = 187.14, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -11.215668  -7.924332
#> sample estimates:
#> mean of x mean of y 
#>     37.50     47.07

数据

heur01 <- c(1411L, 1734L, 1048L, 2060L, 1983L, 1810L, 1387L, 1637L, 1419L, 1637L, 1185L, 1766L, 1484L, 1983L, 
    1217L, 1915L, 1846L, 1887L, 1742L, 988L, 1375L, 1193L, 2056L, 1919L, 1850L, 2076L, 1463L, 1113L, 1887L, 
    1919L, 1734L, 1157L, 1766L, 1951L, 1923L, 2173L, 1609L, 1895L, 1109L, 1028L, 1701L, 1875L, 1677L, 1653L, 
    1883L, 1677L, 1850L, 1738L, 1520L, 1415L, 1992L, 1919L, 1653L, 1625L, 1705L, 1742L, 1891L, 2108L, 1919L, 
    1911L, 1770L, 1834L, 1911L, 2060L, 1717L, 1943L, 1859L, 1738L, 1222L, 1709L, 2052L, 1141L, 1931L, 2068L, 
    2044L, 1725L, 1818L, 1798L, 1943L, 1939L, 1919L, 1790L, 2116L, 1750L, 2052L, 1605L, 1798L, 2169L, 1665L, 
    1673L, 1185L, 1717L, 1717L, 1657L, 1915L, 1778L, 2121L, 1786L, 1774L, 2056L, 1738L, 1883L, 1754L, 1790L, 
    1770L, 1947L, 1867L, 1794L, 1867L, 1790L, 1762L, 2080L, 1778L, 1903L, 1734L, 1838L, 1560L, 1592L, 1637L, 
    1467L, 1750L, 1653L, 1222L, 1709L, 1806L, 1334L, 1584L, 2052L, 1802L, 1774L, 1770L, 1258L, 1334L, 1322L, 
    1826L, 1600L, 2189L, 1907L, 1548L, 1617L, 1693L, 1020L, 992L, 1435L, 1613L, 1738L, 1419L, 1121L, 1629L, 
    1605L, 1455L, 1157L, 1717L, 1294L, 1359L, 1282L, 1758L, 1395L, 1129L, 1189L, 1790L, 1217L, 1133L, 1516L, 
    1516L, 1278L, 1072L, 911L, 1286L, 968L, 1076L, 1315L, 1221L, 1268L, 939L, 1879L, 986L, 1221L, 1456L, 
    1315L, 1785L, 1080L, 1362L, 1503L, 1127L, 1691L, 1174L, 1644L, 1691L, 939L, 1503L, 1080L, 1503L, 1832L, 
    1362L, 1691L, 1456L, 1879L, 1644L, 1033L)

400个随机样本的平均置信区间

1 个答案: