Question

我正在计算Gini coefficient（类似于：Python - Gini coefficient calculation using Numpy），但我得到一个奇怪的结果。对于从np.random.rand()采样的均匀分布，基尼系数为0.3，但我预计它将接近于0（完全相等）。这里出了什么问题？

def G(v):
    bins = np.linspace(0., 100., 11)
    total = float(np.sum(v))
    yvals = []
    for b in bins:
        bin_vals = v[v <= np.percentile(v, b)]
        bin_fraction = (np.sum(bin_vals) / total) * 100.0
        yvals.append(bin_fraction)
    # perfect equality area
    pe_area = np.trapz(bins, x=bins)
    # lorenz area
    lorenz_area = np.trapz(yvals, x=bins)
    gini_val = (pe_area - lorenz_area) / float(pe_area)
    return bins, yvals, gini_val

v = np.random.rand(500)
bins, result, gini_val = G(v)
plt.figure()
plt.subplot(2, 1, 1)
plt.plot(bins, result, label="observed")
plt.plot(bins, bins, '--', label="perfect eq.")
plt.xlabel("fraction of population")
plt.ylabel("fraction of wealth")
plt.title("GINI: %.4f" %(gini_val))
plt.legend()
plt.subplot(2, 1, 2)
plt.hist(v, bins=20)

对于给定的数字集，上面的代码计算每个百分位数据库中总分布值的分数。

结果：

均匀分布应该接近“完全相等”，因此洛伦兹曲线弯曲是关闭的。

Answer 1

这是可以预料的。来自均匀分布的随机样本不会导致均匀值（即，彼此相对接近的值）。通过一点微积分，可以证明，[0,1]上均匀分布的样本的基尼系数的预期值（统计意义上）为1/3，因此得到对于给定样本，大约1/3的值是合理的。

您可以使用v = 10 + np.random.rand(500)等样本获得较低的基尼系数。这些值都接近10.5; 相对变体低于样本v = np.random.rand(500)。实际上，样本base + np.random.rand(n)的基尼系数的期望值是1 /（6 * base + 3）。

这是基尼系数的简单实现。它使用的基尼系数是relative mean absolute difference的一半。

def gini(x):
    # (Warning: This is a concise implementation, but it is O(n**2)
    # in time and memory, where n = len(x).  *Don't* pass in huge
    # samples!)

    # Mean absolute difference
    mad = np.abs(np.subtract.outer(x, x)).mean()
    # Relative mean absolute difference
    rmad = mad/np.mean(x)
    # Gini coefficient
    g = 0.5 * rmad
    return g

以下是v = base + np.random.rand(500)形式的几个样本的基尼系数：

In [80]: v = np.random.rand(500)

In [81]: gini(v)
Out[81]: 0.32760618249832563

In [82]: v = 1 + np.random.rand(500)

In [83]: gini(v)
Out[83]: 0.11121487509454202

In [84]: v = 10 + np.random.rand(500)

In [85]: gini(v)
Out[85]: 0.01567937753659053

In [86]: v = 100 + np.random.rand(500)

In [87]: gini(v)
Out[87]: 0.0016594595244509495

Answer 2

实现速度稍快（使用numpy向量化，并且仅计算一次差异）：

def gini_coefficient(x):
    """Compute Gini coefficient of array of values"""
    diffsum = 0
    for i, xi in enumerate(x[:-1], 1):
        diffsum += np.sum(np.abs(xi - x[i:]))
    return diffsum / (len(x)**2 * np.mean(x))

注意：x必须是一个numpy数组。

Answer 3

有关原始方法的简要说明：

当使用np.traps或其他积分方法直接从曲线下的面积计算基尼系数时，洛伦兹曲线的第一个值必须为0，以便考虑到原点和第二个值之间的面积。对G(v)进行的以下更改可解决此问题：

yvals = [0]
for b in bins[1:]:

我还在this answer中讨论了此问题，其中在这些计算中包括原点，为使用此处讨论的其他方法（不需要附加0）提供了等效的答案。

简而言之，当直接使用积分计算基尼系数时，应从原点开始。如果使用此处讨论的其他方法，则不需要。

Answer 4

请注意，目前在 skbio.diversity.alpha 中存在 gini 索引作为 gini_index。上面提到的例子可能会给出一些不同的结果。

Answer 5

基尼系数是洛伦斯曲线下的面积，通常用于分析人口中的收入分布。 https://github.com/oliviaguest/gini使用python提供了相同的简单实现。

在Python / numpy中计算基尼系数

5 个答案: