Question

首先，我必须承认，我的统计知识充其量是生锈的：即使它是新的，它也不是我特别喜欢的学科，这意味着我很难理解它。

尽管如此，我还是看了barplot图表是如何计算误差线的，并且惊讶地发现了一个＆＃34;置信区间＆＃34; （CI）代替（更常见的）标准偏差。研究更多CI导致我wikipedia article这似乎说，基本上，CI计算如下：

mean minus 1.96 times stdev over sqrt(n)

mean plus 1.96 times stdev over sqrt(n)

或者，在伪代码中：

def ci_wp(a):
    """calculate confidence interval using Wikipedia's formula"""
    m = np.mean(a)
    s = 1.96*np.std(a)/np.sqrt(len(a))
    return m - s, m + s

但我们在seaborn/utils.py中找到的是：

def ci(a, which=95, axis=None):
    """Return a percentile range from an array of values."""
    p = 50 - which / 2, 50 + which / 2
    return percentiles(a, p, axis)

现在也许我完全错过了这个，但这似乎与维基百科提出的计算完全不同。任何人都可以解释这种差异吗？

再举一个例子，从评论中，为什么我们之间会得到如此不同的结果：

 >>> sb.utils.ci(np.arange(100))
 array([ 2.475, 96.525])

 >>> ci_wp(np.arange(100))
 [43.842250270646467,55.157749729353533]

与其他统计工具进行比较：

 def ci_std(a):
     """calculate margin of error using standard deviation"""
     m = np.mean(a)
     s = np.std(a)
     return m-s, m+s

 def ci_sem(a):
     """calculate margin of error using standard error of the mean"""
     m = np.mean(a)
     s = sp.stats.sem(a)
     return m-s, m+s

这给了我们：

>>> ci_sem(np.arange(100))
(46.598850802411796, 52.401149197588204)

>>> ci_std(np.arange(100))
(20.633929952277882, 78.366070047722118)

或使用随机样本：

rng = np.random.RandomState(10)
a = rng.normal(size=100)
print sb.utils.ci(a)
print ci_wp(a)
print ci_sem(a)
print ci_std(a)

......产生：

[-1.9667006   2.19502303]
(-0.1101230745774124, 0.26895640045116026)
(-0.017774461397903049, 0.17660778727165088)
(-0.88762281417683186, 1.0464561400505796)

为什么Seaborn的数字与其他结果有如此根本的不同？

Answer 1

使用此维基百科公式计算是完全正确的。 Seaborn只使用另一种方法：https://en.wikipedia.org/wiki/Bootstrapping_(statistics)。 Dragicevic [1]很好地描述了它：

[它]包括通过随机抽取替代观察结果从实验数据中生成许多替代数据集。假设这些数据集的可变性近似于采样误差，并用于计算所谓的自举置信区间。 [...]它非常通用，适用于多种发行版。

在Seaborn的源代码中，barplot使用estimate_statistic来引导数据，然后计算其上的置信区间：

>>> sb.utils.ci(sb.algorithms.bootstrap(np.arange(100)))
array([43.91, 55.21025])

结果与您的计算一致。

[1] Dragicevic，P。（2016）。人机交互中的公平统计沟通。在HCI的现代统计方法中（第291-330页）。 Springer，Cham。

Answer 2

您需要检查百分位数的代码。您发布的seaborn ci 代码只是计算百分位数限制。此区间的定义平均值为50（中位数），默认范围 95％置信区间。实际均值，标准差等将出现在百分位数例程中。

是否正确计算了seaborn置信区间？

2 个答案: