第一种方法

Question

我有一组数字排序：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing
from sklearn.model_selection import learning_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

df = pd.read_csv('http://archive.ics.uci.edu/ml/'
                 'machine-learning-databases/'
                 'breast-cancer-wisconsin/wdbc.data'
                 , header=None)

X = df.loc[:,2:].values
y = df.loc[:,1].values

le = preprocessing.LabelEncoder()

y = le.fit_transform(y)


X_train, X_test, y_train, y_test =\
train_test_split(X, y, test_size=0.20, \
                 stratify=y, random_state=1)



pipe_lr = make_pipeline(StandardScaler(), 
                        LogisticRegression(penalty='l2', 
                                           random_state=1))

train_sizes, train_scores, test_scores = \
learning_curve(estimator = pipe_lr, X = X_train,
               y = y_train,
               train_sizes=np.linspace(0.1, 1.0, 10),
               cv = 10, n_jobs = 1)

train_mean = np.mean(train_scores, axis = 1)
train_std = np.std(train_scores, axis = 1)
test_mean = np.mean(test_scores, axis = 1)
test_std = np.std(test_scores, axis = 1)

plt.plot(train_sizes, train_mean, color = 'blue',
         marker = 'o', markersize = 5, 
         label = 'training_accuracy')

plt.fill_between(train_sizes,
                 train_mean + train_std,
                 train_mean - train_std,
                 alpha = 0.5, color = 'blue')

plt.plot(train_sizes, test_mean, color = 'green',
         linestyle = '--', marker = 's', markersize = 5,
         label = 'validation accuracy')

plt.fill_between(train_sizes,
                 test_mean + test_std,
                 test_mean - test_std,
                 alpha = 0.15, color = 'green')
plt.grid()
plt.xlabel("Number of training samples")
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim([0.85, 1.025])
plt.show()

我希望该数组的连续数字之间的差（低于dist）大于给定阈值。例如，如果阈值为0.25：

arr = [-0.1, 0.0, 0.5, 0.8, 1.2]

dist = [0.1, 0.5, 0.3, 0.4] # must be >0.25 for all elements和arr[0]彼此距离太近，因此必须对其之一进行修改。在这种情况下，所需的数组将是：

arr[1]

为了获得good_array，我想修改arr中的最小元素数。因此，我从good_array = [-0.25, 0.0, 0.5, 0.8, 1.2] # all elements distance > threshold减去0.15，而不是从arr[0]减去0.1，然后向arr[0]添加0.05：

arr[1]

上一个数组也是有效的，但是我们修改了2个元素，而不是一个。

此外，如果可以通过修改[-0.2, 0.05, 0.5, 0.8, 1.2]中的不同元素来生成good_array，则默认情况下，将元素修改为更靠近数组边缘。但是请记住，主要目标是通过修改arr中的最小元素数来生成arr。

good_array

上一个数组也是有效的，但是我们修改了[-0.1, 0.15, 0.5, 0.8, 1.2]而不是更靠近边缘（arr[1]）的元素。如果2个元素到边缘的距离相等，则修改一个更靠近数组开头的元素：

arr[0]

到目前为止，我一直在手动处理小型阵列，但是我想要大型阵列的通用解决方案。

Answer 1

编辑：我刚刚意识到我原来的解决方案很愚蠢且过于复杂。现在介绍简单，更好的解决方案

第一种方法

如果我正确地理解了您的问题，则您的输入数组可以包含某些区域，其中您的条件不满足。例如：

array = [0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.5, 0.75, 1.0]（前4个元素）

或：

array = [0.25, 0.5, 0.75, 1.0, 1.0, 1.0, 1.0, 1.0, 1.25, 1.5, 1.75]（元素arr [4]，arr [5]和arr [6]）

要解决此问题，您必须添加（或减去）一些模式，例如：

fixup = [0.0, 0.25, 0.0, 0.25, 0.0, 0.0, 0.0, 0.0, 0.0]（对于第一种情况）

或：

fixup = [0.0, 0.0, 0.0, 0.0, 0.25, 0.0, 0.25, 0.0, 0.0, 0.0, 0.0]（对于第二个示例）

第二种方法

但是我们当前的解决方案存在一些问题。考虑一个带有“高程”的不良区域：

array = [0.0, 0.25, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.35, 1.6]（折断面积在0.6-1.0范围内）

在这种情况下，我们正确的“解决方案”将是：

fixup = [0.0, 0.0, 0.0, 0.25+0.1, 0.0, 0.25+0.1, 0.0, 0.25+0.1, 0.0, 0.0, 0.0]

产生：

good_array = [0.0, 0.25, 0.5, 0.95, 0.7, 1.15, 0.9, 1.0, 1.1, 1.35, 1.6]

总而言之，您必须应用“补丁”：

fixup[i] = threshold+max(difference[i], difference[i-1])（对于i为偶数的i-start_index而言

（请注意，负值将为-threshold+min(difference[i], difference[i-1])）

和：

fixup[i] = 0（对于i为奇数的i-start_index，

start_index是不良地区的开始。

第三种方法

先前提到的公式在某些情况下效果不佳（例如[0.1, 0.3, 0.4]，如果仅0.3就可以将0.75增加到0.65）

让我们尝试改善这一点：

good_array[i] = max(threshold+array[i-1], threshold+array[i+1])（用于abs(array[i-1]-array[i+1]) < threshold*2）

和：

good_array[i] = (array[i-1]+array[i+1])/2否则。

（如果将差异最小化也是您的优化目标，则可以选择公式：good_array[i] = min(-threshold+array[i-1], -threshold+array[i+1])，使结果更接近原始数组值）

第四种方法

长度相等的坏区也是一个威胁。我可以考虑两种解决方法：

基于[0.0, 0.25, 0.5, 0.0]之类的模式的解决方案
或者基于类似[0.0, 0.25, -0.25, 0.0]的模式（我们只是使用“第二个公式”）
或[0.0, 0.25, 0.0, 0.25]（仅添加其他元素以使不良区域长度变得奇怪-我不推荐这种方法，因为这将需要处理很多拐角处的情况）

角落案例

还请考虑一些极端情况（坏区域在数组的“边缘”开始或结束）：

good_array[0] = threshold+array[1]

和：

good_array[array_size-1] = threshold+array[array_size-2]

最终提示

我建议在实施过程中实施大量的单元测试，以便轻松地验证派生公式的正确性并处理一些极端情况的组合。 仅包含一个元素的坏区域可以是其中之一。

Answer 2

这是蛮力的python解决方案，其中我们尝试在发生冲突时将元素固定在右侧或左侧：

def solve(arr, thereshold):
    original = list(arr)

    def solve(idx):
        if idx + 1 >= len(arr):
            return [sum(1 for x in range(len(arr)) if arr[x] != original[x]), list(arr)];

        if arr[idx + 1] - arr[idx] < thereshold:
            copy = list(arr)    

            leftCost = 0
            while idx - leftCost >= 0 and arr[idx + 1] - arr[idx - leftCost] < thereshold * (leftCost + 1):
                arr[idx - leftCost] = arr[idx - leftCost + 1] - thereshold
                leftCost += 1

            left = solve(idx + 1)
            for cost in range(leftCost):
                arr[idx - cost] = copy[idx - cost]  

            rightCost = 0
            while idx + rightCost + 1 < len(arr) and arr[idx + rightCost + 1] - arr[idx] < thereshold * (rightCost + 1):
                arr[idx + rightCost + 1] = arr[idx + rightCost ] + thereshold
                rightCost += 1

            right = solve(idx + 1)
            for cost in range(rightCost):
                arr[idx + cost + 1] = copy[idx + cost + 1]  

            if right[0] < left[0]:
                return right
            elif left[0] < right[0]:
                return left
            else:
                return left if idx - left[0] <= len(arr) - idx - right[0] else right 

        else:
            return solve(idx + 1)               


    return solve(0)

print(solve([0,0.26,0.63,0.7,1.2], 0.25))

查找修改数组以满足条件的最少操作集

2 个答案:

第一种方法

第二种方法

第三种方法

第四种方法

角落案例

最终提示