Question

快速提问

我有一系列数据点在两个列表中定义为X和Y.我正在寻找一种有效的算法来选择（假设）来自X和Y的10个值，这些值不仅高于Y的特定值（阈值），但也尽可能地分布在X的值上。通过'spread'，我的意思是最大化相邻点的X之间的差值。

例如：

如果Y阈值= 100且X范围= 1-10，则理想的值集将是。

[1,104]
[2.5,120]
[3,101]
[4.7,150]
[5.2,190]
[6.3,115]
etc

非理想的集合将是：

[1,104]
[1.3,157]
[1.6,174]
[1.5,120]
[1.17,135]
Etc

任何想法都会非常感激

Answer 1

首先提取数据点，使Y > Threshold。

然后按增加X排序。获取Xmin和Xmax并在此范围内计算8个等间距X的其他值，以形成增加列表Z（＆＃34;理想值＆＃34;值）。

现在并行扫描两个列表，就像在合并操作中一样。每次移动Z列表时，请保留相应的X元素。

注意：如果算法找到与两个不同X相对应的Z，则此过程可能会失败。解决这个问题并不是那么明显。

Answer 2

与@Yves Daoust的解决方案类似，我写了一个脚本：

from itertools import combinations
def get_max_x_scatter(datapoints, y_threshold, no_of_points):
        # First exclude the data points that is below y_threshold
        candidates = filter(lambda x: x[1]>y_threshold, datapoints)
        if len(candidates)<no_of_points:
                print "Not enough data points"
                return
        # Sort the candidate data points by x 
        candidates_sorted_by_x = sorted(candidates, lambda m,n:int(m[0]-n[0]))
        # Get the x distance of 2 data points on remote ends
        distance = candidates_sorted_by_x[-1][0]-candidates_sorted_by_x[0][0]
        # Divide by the number of data points wanted, you get the expected average delta 
        avg = distance/(no_of_points-1)
        # Within the K data points, find n of them that is *most scattered*

        min_delta = distance * no_of_points # make sure the initial min is large enough
        result = None
        for combination in combinations(candidates_sorted_by_x, no_of_points):
                delta = 0.0
                for i in range(1, no_of_points):
                        gap = combination[i][0] - combination[i-1][0]
                        delta += abs(gap - avg)
                if delta < min_delta:
                        min_delta = delta
                        result = combination
        return result


dp = [
[1,104],
[1.3,157],
[1.6,174],
[1.5,120],
[1.17,135],
[2.5,120],
[3,101],
[4.7,150],
[5.2,190],
[6.3,115],
[2,23]]

print get_max_x_scatter(dp, 100, 5)

>>> ([1, 104], [1.6, 174], [3, 101], [4.7, 150], [6.3, 115])

此算法可最大限度地减少数据点与平均增量的偏差，它可能是也可能不是您想要的。但它可以被描述为尽可能分散＆＃39;。

Answer 3

我认为你应该更准确地定义“最大化X值的分布”。无论如何，让我们假设你有一个函数f(S)，它为一组点S返回此集合的spread across values of X。您可以尝试以下贪婪算法（下面的伪代码），它只是一个接一个地选择潜在值。

我假设您的初始值集合为(X_i, Y_i)的{{1}}。

1 <= i <= n

这会降低Let S = empty list Let i = 0 While |S| <= 10 If (Y_i > threshold) add (X_i,Y_i) to S i++; While i < n If Y_i >= threshold Let j be such that X_j <= X_i <= X_{j+1} Let S_j = S, S_{j+1} = S; Remove (X_j, Y_j) from S_j and (X_{j+1}, Y_{j+1}) from S_{j+1} Add (X_i, Y_i) to S_j and to S_{j+1} If f(S_j) > f(S) let S = S_j If f(S_{j+1})> f(S) let S = S_{j+1} i++; Return S之类的复杂性，其中p*n是您要选择的值的数量（假设您保持p已排序且具有相对快速的计算方式{ {1}}）。然而，我不确定这种贪心算法是否会产生最佳解决方案。我猜它至少会对S的某些合理形式起作用。

找到超过Y阈值的前10个值并在X上展开

3 个答案: