Question

我有一个如下所示的数据框：

Out[14]:
    impwealth  indweight
16     180000     34.200
21     384000     37.800
26     342000     39.715
30    1154000     44.375
31     421300     44.375
32    1210000     45.295
33    1062500     45.295
34    1878000     46.653
35     876000     46.653
36     925000     53.476

我想使用impwealth中的频率权重计算列indweight的加权中位数。我的伪代码如下所示：

# Sort `impwealth` in ascending order 
df.sort('impwealth', 'inplace'=True)

# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)

# Search for the first occurrence of `impweight` that is greater than P 
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()

# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']

这种方法看起来很笨重，我不确定它是否正确。我没有在pandas参考中找到内置方法来做到这一点。找到加权中位数的最佳方法是什么？

Answer 1

如果你想在纯熊猫中这样做，这是一种方式。它也没有内插。（@svenkatesh，你错过了伪代码中的累积总和）

df.sort_values('impwealth', inplace=True)
cumsum = df.indweight.cumsum()
cutoff = df.indweight.sum() / 2.0
median = df.impwealth[cumsum >= cutoff].iloc[0]

这给出了925000的中位数。

Answer 2

您是否尝试过wquantiles套餐？我之前从未使用它，但它有一个加权中值函数，似乎至少给出了一个合理的答案（你可能想要仔细检查它是否使用你期望的方法）。

In [12]: import weighted

In [13]: weighted.median(df['impwealth'], df['indweight'])
Out[13]: 914662.0859091772

Answer 3

您也可以使用我为同一目的而编写的此功能。

注意：加权使用插值在末尾选择0.5分位数（你可以自己查看代码）

我的书面函数只返回0.5个权重。

import numpy as np

def weighted_median(values, weights):
    ''' compute the weighted median of values list. The 
weighted median is computed as follows:
    1- sort both lists (values and weights) based on values.
    2- select the 0.5 point from the weights and return the corresponding values as results
    e.g. values = [1, 3, 0] and weights=[0.1, 0.3, 0.6] assuming weights are probabilities.
    sorted values = [0, 1, 3] and corresponding sorted weights = [0.6,     0.1, 0.3] the 0.5 point on
    weight corresponds to the first item which is 0. so the weighted     median is 0.'''

    #convert the weights into probabilities
    sum_weights = sum(weights)
    weights = np.array([(w*1.0)/sum_weights for w in weights])
    #sort values and weights based on values
    values = np.array(values)
    sorted_indices = np.argsort(values)
    values_sorted  = values[sorted_indices]
    weights_sorted = weights[sorted_indices]
    #select the median point
    it = np.nditer(weights_sorted, flags=['f_index'])
    accumulative_probability = 0
    median_index = -1
    while not it.finished:
        accumulative_probability += it[0]
        if accumulative_probability > 0.5:
            median_index = it.index
            return values_sorted[median_index]
        elif accumulative_probability == 0.5:
            median_index = it.index
            it.iternext()
            next_median_index = it.index
            return np.mean(values_sorted[[median_index, next_median_index]])
        it.iternext()

    return values_sorted[median_index]
#compare weighted_median function and np.median
print weighted_median([1, 3, 0, 7], [2,3,3,9])
print np.median([1,1,0,0,0,3,3,3,7,7,7,7,7,7,7,7,7])

Answer 4

此功能概括了校对员的解决方案：

def weighted_median(df, val, weight):
    df_sorted = df.sort_values(val)
    cumsum = df_sorted[weight].cumsum()
    cutoff = df_sorted[weight].sum() / 2.
    return df[cumsum >= cutoff][val].iloc[0]

在此示例中为weighted_median(df, 'impwealth', 'indweight')。

Answer 5

您可以使用this solution至Weighted percentile using numpy：

def weighted_quantile(values, quantiles, sample_weight=None, 
                      values_sorted=False, old_style=False):
    """ Very close to numpy.percentile, but supports weights.
    NOTE: quantiles should be in [0, 1]!
    :param values: numpy.array with data
    :param quantiles: array-like with many quantiles needed
    :param sample_weight: array-like of the same length as `array`
    :param values_sorted: bool, if True, then will avoid sorting of
        initial array
    :param old_style: if True, will correct output to be consistent
        with numpy.percentile.
    :return: numpy.array with computed quantiles.
    """
    values = np.array(values)
    quantiles = np.array(quantiles)
    if sample_weight is None:
        sample_weight = np.ones(len(values))
    sample_weight = np.array(sample_weight)
    assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \
        'quantiles should be in [0, 1]'

    if not values_sorted:
        sorter = np.argsort(values)
        values = values[sorter]
        sample_weight = sample_weight[sorter]

    weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
    if old_style:
        # To be convenient with numpy.percentile
        weighted_quantiles -= weighted_quantiles[0]
        weighted_quantiles /= weighted_quantiles[-1]
    else:
        weighted_quantiles /= np.sum(sample_weight)
    return np.interp(quantiles, weighted_quantiles, values)

以weighted_quantile(df.impwealth, quantiles=0.5, df.indweight)呼叫。

Answer 6

您还可以使用robustats库来计算加权中位数：

import numpy as np
import robustats # pip install robustats


# Weighted Median
x = np.array([1.1, 5.3, 3.7, 2.1, 7.0, 9.9])
weights = np.array([1.1, 0.4, 2.1, 3.5, 1.2, 0.8])

weighted_median = robustats.weighted_median(x, weights)

print("The weighted median is {}".format(weighted_median))

Answer 7

有一个 weightedstats 包，可通过 conda 和 pip 获得，它执行 weighted_median。

假设您在终端 (Mac/Linux) 或 Anaconda 提示符 (Win) 中使用 conda：

conda activate YOURENVIRONMENT
conda install -c conda-forge -y weightedstats

（-y 的意思是“不要让我确认更改，直接去做”）

然后在您的 Python 代码中：

import pandas as pd
import weightedstats as ws

df = pd.read_csv('/your/data/file.csv')
ws.weighted_median(df['values_col'], df['weights_col'])

我不确定它是否适用于所有情况，但我只是将一些简单数据与 R 包 weightedMedian() 中的函数 matrixStats 进行了比较，我得到了相同的结果两者兼而有之。

P.S.：顺便说一句，您也可以使用 weightedstats 计算 weighted_mean()，尽管使用 NumPy 也可以：

np.average(df['values_col'], weights=df['weights_col'])

Python：带有pandas的加权中值算法

7 个答案: