Question

我有一个描述某些人口的值列表。这些值分布在两个峰中，如下面的直方图所示。 histogram with two distinct peaks

是否有一种简单的方法可以自动检测分布中心的“间隙”并将初始列表拆分为两侧？如果可能的话，最好使用numpy。

~~编辑添加：显然我可以对列表进行排序，迭代它并将其拆分为第一个零值，但~~我希望有一个更加健壮的方法，即使是“合理的”如果两个峰没有那么清楚地分开。请注意strikethro注释不起作用，它是具有零值的直方图，而不是数据，oops！

Answer 1

如果您有关于您的分布的先验信息，例如：在组内连续存在两个“组”样本。然后你可以使用一个朴素的算法：找到两个样本之间的较大差距。

但是将人口分成子集（群集）的问题并不重要，通常通过机器学习聚类算法来解决：http://scikit-learn.org/stable/modules/clustering.html

Answer 2

Jean-loup的答案适用于提出的问题。然而，通过一个有效的，简单的实现，我可以更多地思考这个问题，并且我发布了我提出的方法，以防它有用。

def split_population2(seq, n_bins):
    """ Split a population into sub-populations

        Based on binning the data into n_bins and finding contigous groups of non-empty bins.

        Returns [lowest, ..., highest] all of which are sorted sequences
    """
    sorted_pop = sorted(seq)

    # bin the data into n_bins in a 2d structure, one sequence for each bin:
    _ , bins = np.histogram(sorted_pop, bins=n_bins)
    bin_indices = np.digitize(sorted_pop, bins)
    binned = []
    for i in range(len(bins)+1):
        binned.append([])
    for ix_bin, v in zip(bin_indices, sorted_pop):
        binned[ix_bin].append(v)

    # now join-up non-empty bins
    joined_bins = [[]] # so 2D, with initially 1 sub-list
    len_last_bin = 0
    for bin in binned:
        len_bin = len(bin)
        if len_bin == 0 and len_last_bin != 0: # will correctly handle the case where bin 0 is empty
            joined_bins.append([])
        if len_bin != 0:
            joined_bins[-1].extend(bin)
        len_last_bin = len_bin

    return joined_bins

我认为这适用于存在2个以上亚群的情况，并且对于子群明显分开的情况应该相对稳健。缺点是在某些情况下答案取决于为n_bins选择的值。

original histogram histogram showing sub-populations

在Python中将人口分成两部分

2 个答案: