Question

我用Python的Scikit-learn库编写了一个简单的线性回归和决策树分类器代码，用于预测结果。效果很好。

我的问题是，是否有一种方法可以反向执行此操作，以根据推算结果（参数中精度最高的参数）预测参数值的最佳组合。

或者我可以这样问，是否存在可以基于一个预测多个结果的分类，回归或其他类型的算法（决策树，SVM，KNN，Logistic回归，线性回归，多项式回归...）（或更多）参数？

我尝试通过放入多变量结果来做到这一点，但它显示了错误：

ValueError：预期的2D数组，取而代之的是1D数组： array = [101905182182268646624465]。如果数据具有单个功能，则使用array.reshape（-1，1）调整数据的形状；如果数据包含单个功能，则使用array.reshape（1，-1）调整数据的形状。样本。

这是我为回归编写的代码：

import pandas as pd
from sklearn import linear_model
from sklearn import tree

dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': [101, 905, 182, 268, 646, 624, 465]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

regression = linear_model.LinearRegression()
regression.fit(variables, results)

input_values = [14, 2]

prediction = regression.predict([input_values])
prediction = round(prediction[0], 2)
print(prediction)

这是我为决策树编写的代码：

dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'yes']}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(variables, results)

input_values = [18, 2]

prediction = decision_tree.predict([input_values])[0]
print(prediction)

Answer 1

您可以将问题描述为 optimization problem 。

让（经过训练的）回归模型输入值成为要搜索的参数。

定义模型的预测价格（在给定的输入组合下）与期望价格（您想要的价格）之间的距离，作为成本函数。

然后使用global optimization algorithms（例如genetic optimization）之一找到可以使成本最小化（即预测价格最接近您的期望价格）的输入组合。

Answer 2

考虑到您提到的真实示例，我建议您将输入视为价格范围，而不仅仅是价格，在这种情况下，可以将要素组合在一起以对应特定的价格范围。

因此，您可以从对数据集进行聚类并根据房价形成聚类开始，Mean Shift聚类算法还将建议可以在数据中形成的聚类数量。

然后，您可以确定每个集群的最低和最高房价，然后可以获取数值数据和大多数分类数据（用于预测房价的功能）的平均值，并说出这些预测值与此价格范围相对应。

映射完成后，我们可以看到输入对应于价格范围的哪个集群，然后如上所述获得汇总参数。

数据集来源：https://github.com/ageron/handson-ml/tree/master/datasets/housing

代码：

import pandas as pd
df = pd.read_csv('housing.csv')
df.drop(['longitude','latitude'], axis=1, inplace=True)
X_train = df['median_house_value']

X_train.head()
import numpy as np
X_train = np.array(X_train)
X_train = np.reshape(X_train,(-1,1))

from sklearn.cluster import MeanShift, estimate_bandwidth
ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X_train)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)
print(labels)

df['cluster'] = labels

df1 = df[df['cluster'] == 1]
df2 = df[df['cluster'] == 0]

ranges = []

ranges.append([min(df1['median_house_value']),max(df1['median_house_value'])])

ranges.append([min(df2['median_house_value']),max(df2['median_house_value'])])


df1_categorical = 'ocean_proximity'
df1_categorical_set = df1[df1_categorical]
df1 = df1.drop(df1_categorical, axis=1)
df2_categorical_set = df2[df1_categorical]
df2 = df2.drop(df1_categorical, axis=1)
df1_feature = []

for i in df1.columns :
    df1_feature.append(np.mean(df1[i]))

df2_feature = []

for i in df1.columns :
    df2_feature.append(np.mean(df2[i]))

print ("Range : ",ranges[0],"\nFeatures : ",df1_feature,'\n',"Range : ",ranges[1],"\nFeatures : ", df2_feature)

如果现在打印df1_features和df2_features，您将获得两个群集范围的平均特征值（也可以打印在列表范围的后面），因此价格范围与第一个相同的任何房屋都将具有df1_features是理想的功能集，而df2_features也是如此。

如果您想要更多的价格范围，可以使用k均值进行聚类，指定聚类数量

Answer 3

@taga，我认为您是指多元回归。我为此使用了偏最小二乘（PLS），即拥有一组N个要素，您可以创建一个模型来估计M个输出，最后是一个NxM矩阵。这听起来像您要找的东西吗？我可以进一步详细说明。

编辑：

使用您提供的相同代码将类似于：

import pandas as pd
from sklearn import linear_model
from sklearn import tree

dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome1': [101, 905, 182, 268, 646, 624, 465],
       'outcome2': [105, 320, 135, 208, 262, 324, 246]
}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-2]
results = df.iloc[:,-2:]

regression = linear_model.LinearRegression()
regression.fit(variables, results)

input_values = [14, 2]

prediction = regression.predict([input_values])
prediction = [round(x,2) for x in prediction[0]]
print(prediction)

您需要将结果作为LxM数组传递给模型拟合函数，其中L是样本数，M是结果数。

希望有帮助。

Answer 4

如@Justas所述，如果要找到输出变量为max / min的最佳输入值组合，则这是一个优化问题。

scipy中提供了相当多的非线性优化器，您也可以使用遗传算法，模因算法等元启发式方法。

另一方面，如果您的目标是学习逆函数，该函数将输出变量映射到一组输入变量，则进行MultiOuputRegresssor或MultiOutputClassifier的转换。两者都可以用作线性回归，逻辑回归，KNN，DecisionTree，SVM等任何基础估计量的包装。

示例：

import pandas as pd
from sklearn.multioutput import MultiOutputRegressor, RegressorChain
from sklearn.linear_model import LinearRegression


dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': [101, 905, 182, 268, 646, 624, 465]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

multi_output_reg = MultiOutputRegressor(LinearRegression())
multi_output_reg.fit(results.values.reshape(-1, 1),variables)

multi_output_reg.predict([[100]])

# array([[12.43124217,  1.12571947]])
# sounds sensible according to the training data

#if input variables needs to be treated as categories,
# go for multiOutputClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs'))
multi_output_clf.fit(results.values.reshape(-1, 1),variables)

multi_output_clf.predict([[100]])

# array([[10,  1]])

在大多数情况下，找到输入变量值之一可以帮助预测其他变量。这种方法可以通过ClassifierChain或RegressorChain来实现。

要了解ClassifierChain的优势，请参阅this示例。

更新：


dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': [0, 1, 1, 1, 1, 1 , 0]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs',
                                                            multi_class='ovr'))
multi_output_clf.fit(results.values.reshape(-1, 1),variables)

multi_output_clf.predict([[1]])
# array([[13,  3]])

Answer 5

如果我理解这个问题，我认为基本的神经网络就能胜任。当您说“可以基于一个（或多个）参数预测多个结果吗？”时，您可以将尽可能多的参数输入神经网络，并尽可能多地输入不同的结果。如果您为问题决定要二元决策（即是或否），则基本感知器也将起作用。这两种方法都允许您输入尽可能长的输入向量。

希望我正确理解了您的问题，并提供了解决问题的有用方法！

Answer 6

对于回归，您可以提取系数并确定哪些输入将产生最大输出。看起来像这样：

# We extract the linear's regression coefficients
coeff = regression.coef_
input_values = list(zip(dic['par_1'], dic['par_2']))
# We choose the best input thanks to those coefficients
import numpy as np # import numpy to extract the coeffecients
index_best_input = np.argmax([x[0]*coeff[0] + x[1]*coeff[1] for x in input_values])

best_input = input_values[index_best_input]

In [1] : print(best_input)
Out[1] : (33,3)

对于您的决策树，最好的方法是查看每个叶子并查看您的精度，同时考虑每个叶子中的训练条目数。您可以做的就是打印树：

from sklearn import tree
import graphviz 
from sklearn.datasets import load_iris
dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'yes']}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(variables, results)

dot_data = tree.export_graphviz(decision_tree, out_file=None) 
graph = graphviz.Source(dot_data)  
print(graph)

您可以看到有四个100％精度的好候选者，但仅是样本：

具有par_1> 31.5的输入
输入为11.5
具有16个输入
具有16个输入

使用Python使用一个（或多个）参数进行多输出回归或分类

6 个答案: