如何执行引导并找到数据集中位数的95%置信区间

时间:2018-07-23 14:54:46

标签: r confidence-interval

我正在使用仅包含一列“总计”的数据集“文件”的统计中值执行引导。就是这样:

Total <-
c(2089, 1567, 1336, 1616, 1590, 1649, 1341, 1614, 1590, 1621, 
1621, 1631, 1295, 107, 18, 195, 2059, 870, 2371, 787, 98, 2422, 
655, 1277, 1336, 2109, 1811, 1337, 1290, 1308, 1359, 1600, 1296, 
693, 107, 1359, 89, 89, 89, 89, 2411, 1639, 89, 89, 1283, 89, 
89, 89, 2341, 1012, 1295, 1853, 1277, 1571, 1288, 1300, 1619, 
107, 555, 1612, 1300, 1300, 2093, 133, 1674, 988, 132, 647, 606, 
544, 873, 274, 120, 1620, 1601, 1601, 906, 1603, 1613, 1592, 
1603, 1610, 1321, 2380, 1575, 1575, 1277, 2354, 1561, 1579, 2367, 
2341, 876, 1612, 1588, 2087, 1612, 890, 1586, 1580, 611, 1797, 
2079, 1937, 189, 171, 706, 1647, 1642, 1278, 1650, 1623, 1647, 
1661, 1692, 1632, 1684, 2474, 403, 842, 593, 98, 2354, 1265, 
866, 1483, 2379, 1650, 1875, 1655, 1632, 1691, 1329, 867, 1632, 
1693, 1623, 829, 1659, 1685, 666, 1585, 1659, 2169, 1623, 1645, 
1654, 1698, 2172, 789, 1698, 579, 2443, 335, 132, 1952, 1265, 
978, 1624, 979, 1729, 607, 181, 752, 424, 386, 309, 998, 1435, 
2476, 392, 1657, 348, 1652, 1646, 1345, 2445, 1655, 840, 1624, 
1652, 1321, 1321, 2201, 957, 917, 2458, 4096, 2458, 1346, 2459, 
1634, 2459, 2459, 2459, 2508, 714, 2457, 2457, 1703, 669, 976, 
1634, 2459, 2491, 2393, 625, 1763, 879, 886, 1085, 731, 924, 
1649, 1216, 1647, 2470, 668, 2326, 757, 215, 276, 186, 901, 1402, 
429, 554, 2457, 1643, 986, 730, 1028, 971, 1952, 1584, 1023, 
1352, 839, 2434, 430, 2462, 1327, 1004, 385, 1099, 1067, 758, 
679, 1423, 2495, 1664, 2495, 2495, 1345, 2530, 1754, 1804, 2525, 
1652, 2536, 1646, 2529, 1380, 1845, 963, 1339, 2482, 1417, 1729, 
1384, 1648, 344, 1648, 955, 609, 485, 1822, 513, 223, 222, 193, 
1410, 1159, 586, 585, 2671, 2702, 2529, 2212, 1658, 741, 2529, 
861, 1758, 905, 2529, 597, 1049, 2529, 619, 2620, 2596, 1688, 
2590, 2545, 2590, 883, 287, 723, 2565, 1835, 1738, 2243, 1693, 
2565, 250, 2529, 1880, 1777, 701, 444, 927, 1127, 825, 2726, 
1977, 235, 241, 269, 660, 1523, 420, 678, 213, 544, 940, 983, 
605, 2716, 1848, 1848, 182, 1225, 365, 993, 224, 267, 309, 271, 
324, 178, 2657, 1772, 546, 456, 2637, 1771, 677, 1409, 653, 2359, 
690, 828, 2742, 1812, 2777, 552, 1572, 2742, 2792, 2819, 1753, 
265, 1901, 1753, 2716, 2800, 2742, 453, 2742, 586, 1920, 929, 
1897, 2742, 1859, 1899, 1106, 1135, 759, 730, 1838, 863, 1929, 
2751, 2751, 2751, 2751, 713, 430, 2788, 1784, 966, 2483, 1784, 
1786, 2727, 857, 1798, 1815, 730, 390, 593, 1489, 1448, 1784, 
1510, 2788, 812, 856, 808, 941, 2797, 2757, 1852, 2757, 2412, 
486, 1034, 615, 845, 974, 727, 969, 2916, 1841, 1926, 1926, 533, 
446, 733, 696, 1214, 1857, 1907, 2824, 2631, 3556, 2496, 1617, 
1000, 707, 936, 761, 960, 1936, 857, 423, 1130, 1165, 2453, 338, 
988, 1869, 1951, 1932, 2820, 2742, 628, 447, 866, 637, 932, 2742, 
1795, 2881, 695, 762, 2778, 427, 714, 2781, 1865, 1861, 678, 
1465, 1770, 845, 356, 817, 385, 1820, 2692, 1787, 1510, 1814, 
857, 2616, 204, 465, 1773, 2754, 1793, 1773, 1900, 185, 2706, 
1162, 766, 2742, 1816, 2742, 1790, 1803, 1795, 1026, 334, 832, 
478, 1849, 2679, 1773, 797, 2649, 1814, 1808, 99, 2037, 2616, 
2719, 1813, 2637, 2648, 1813, 865, 1717, 2588, 2711, 2818, 1828, 
2553, 2720, 1791, 1780, 2706, 2565, 1717, 1881, 1037, 329, 893, 
723, 1821, 2692, 2586, 2729, 1755, 1793, 2670, 2602, 2638, 2684, 
1813, 1755, 1755, 2626, 832, 739, 724, 1968, 2598, 2627, 851, 
749, 684, 625, 2673, 2778, 1764, 2644, 1800, 1792, 511, 2776, 
1890, 1764, 2776, 1040, 1049, 2699, 2061, 897, 1764, 274, 2755, 
1912, 2581, 1780, 820, 1803, 2692, 2783, 572, 2751, 2699, 1830, 
1875, 633, 1083)

然后我尝试使用bootstrap函数:

> boot (Total, median, 1000)        

普通非参数引导

致电: 引导(数据=总数,统计=中位数,R = 1000)

引导程序统计信息:     原始偏差标准错误 t1 * 1603 0 0 有50个或更多警告(请使用warnings()查看前50个警告)

警告消息是: 条件的长度> 1,并且只会使用第一个元素

您能告诉我如何执行自举以生成中值的95%置信区间吗?我是一个初学者,非常感谢您的帮助。

非常感谢您。

4 个答案:

答案 0 :(得分:2)

启动程序包中的boot函数在功能上似乎有点不直观。但是,如果您阅读文档(或查看文档中的示例),则会看到有关statistic参数的具体说明:

  

在所有其他情况下,统计信息必须至少包含两个参数。的   传递的第一个参数将始终是原始数据。第二   将是索引,频率或权重的向量,这些向量定义了   引导程序样本。

所以代替:

x <- rnorm(10)
boot(data = x,statistic = median,R = 1000)

您想要这个:

boot(data = x,statistic = function(x,i) median(x[i]),R = 1000)

到此为止,函数boot.ci()可用于计算置信区间(我相信在此特定示例中只有其中一些可用)。

b <- boot(data = x,statistic = function(x,i) median(x[i]),R = 1000)
boot.ci(b)

答案 1 :(得分:0)

尽管@joran的答案是正确的,但由于我已经使用CI计算对代码进行了测试,所以就可以了。

document.createElement()

答案 2 :(得分:0)

这是您“制作自己的” bootrap的方法:

# number of bootstrap replicates
B <- 10000

# create empty storage container
result_vec <- vector(length=B)

for(b in 1:B) {
    # draw a bootstrap sample
    this_sample <- sample(Total, size=length(Total), replace=TRUE)

    # calculate your statistic
    m <- median(this_sample)

    # save your calucated statistic
    result_vec[b] <- m
}

# then probably draw a histogram of your bootstrapped replicates
hist(result_vec)

# get 95% confidence interval
result_vec <- result_vec[order(result_vec)]
lower_bound <- result_vec[round(0.025*B)]
upper_bound <- result_vec[round(0.0975*B)]

答案 3 :(得分:0)

我在此代码中使用标准的普通随机生成器:

B <- i
bs.result <- matrix(NA, nrow=i, ncol=...)
for (b in 1:i) {
sample.n <- rnorm(n, mean-..., sd=...)
optim.b <- optim(c(mu=0, sd=1), loglik, control=list(fnscale=-1), z=sample.n)
bs.result <- c(optim.b$par, optim.b$converge)
}

在表的最后一列,您可以检查优化函数是否收敛。