Question

我正在尝试为两个图表运行KS测试一个是原始数据图（红色），另一个是幂律拟合

Devices, 1 by 1:
{0, 'zero', '00aa00bb00cc'}
{'00dd00ee00ff', 1, 'one'}

其中Red.Y是x的每个点的y值，Blue.Y是每个x的幂律y值。

print("Devices, looped:")
for device in devices:
    print("device_mac" in devices)

似乎p值非常大，因为图表不相同。我可以问一下原因吗？

Red.Y的值是：

from scipy import stats
stats.ks_2samp(Red.Y, Blue.Y)

Blue.Y的值是：

Out[210]:
Ks_2sampResult(statistic=0.16666666666666669, pvalue=0.99133252540492101)

Answer 1

基本上，在KS测试中，您希望比较2个数据样本(see from from Wikipedia)的2个累积分布（CDF）。假设您有蓝线数据和红线数据

    b.forField(id)
            .withConverter(Integer::valueOf, String::valueOf, "Invalid")
            .withNullRepresentation(0)
            .withValidator(new IntegerRangeValidator("Value must be greater than 0 ", 1, null))
            .bind(Customer::getId, Customer::setId);
    b.forField(name)
            .bind(Customer::getName, Customer::setName);
    b.readBean(customer);

D统计量（第一个值返回）是2个CDF之间的最大距离。

对于p值，通过将该CDF差乘以布朗桥分布来计算。您可以看到他们如何计算from the source code。基本上，如果你比较CDF与分布之间的差异并且它仍然相似，我们将得到例如red_line = [0.000018, 0.000019, 0.000016, 0.000018, 0.000018, 0.000022, 0.000021, 0.000022, 0.000025, 0.000025, 0.000024, 0.000026] blue_line = [0.000017, 0.000017, 0.000018, 0.000019, 0.000020, 0.000021, 0.000021, 0.000022, 0.000023, 0.000024, 0.000025, 0.000026] n1 = len(red_line) n2 = len(blue_line) # CDF of red line cdf1 = np.searchsorted(red_line, red_line + blue_line, side='right') / (1.0*len(red_line)) # CDF of blue line cdf2 = np.searchsorted(blue_line, red_line + blue_line, side='right') / (1.0*len(blue_line)) # D-statistic d = np.max(np.absolute(cdf1 - cdf2))（意味着你不能拒绝它们不是来自同一分布）。

p > 0.1

根据这里给出的数据，我得到了from scipy.stats import distributions en = np.sqrt(n1 * n2 / float(n1 + n2)) prob = distributions.kstwobign.sf((en + 0.12 + 0.11 / en) * d) # p-value。

所以是的，即使图表看起来不同，当你绘制2个样本的CDF时，它可能非常相似，这就是p值仍然很大的原因。

Python KS测试 - 为什么P值如此之大

1 个答案: