如何确保我从此回归生成器中获得正确的结果?

时间:2017-04-19 04:46:30

标签: python machine-learning scikit-learn regression precision-recall

我写了一个简单的脚本来生成和回归随机样本数据:

import matplotlib.pyplot as plt
import numpy as np
import random
import sklearn.datasets
import sklearn.linear_model as lm
##########################################
n = np.random.randint(1,10)
b = np.random.randint(50,200)
X1_, Y1_ = sklearn.datasets.make_regression(n_samples=100, n_features=1, noise=n, bias=b)
X1 = X1_.reshape(len(X1_), 1)
Y1 = Y1_.reshape(len(Y1_), 1)
##########################################
x = np.array(X1)
y = np.array(Y1)
##########################################
lr = lm.LinearRegression()
lr.fit(x, y)
td = np.arange(1, 101, 1).reshape(100, 1)
n_y = lr.predict(td)
##########################################
f, ax = plt.subplots(1, 2, sharey=True)
ax[0].scatter(x, y)
ax[0].set_xlim([-4, 4])
ax[0].set_title("x, y")
ax[1].plot(x, n_y, 'g')
ax[1].set_xlim([-4, 4])
ax[1].set_title("x_tr, y_lr")
f.suptitle("Regression")
plt.ylim(y.min()-1, y.max()+1)
##########################################
print ("Array:   {}\nType:   {}\nShape:   {}\nLength:   {}\nData:   {}\n".format("X1",  type(X1),  str(np.shape(X1)),  len(X1),   str(X1)))
print ("Array:   {}\nType:   {}\nShape:   {}\nLength:   {}\nData:   {}\n".format("Y1",  type(Y1),  str(np.shape(Y1)),  len(Y1),   str(Y1)))
print ("Array:   {}\nType:   {}\nShape:   {}\nLength:   {}\nData:   {}\n".format("x",   type(x),   str(np.shape(x)),   len(x),    str(x)))
print ("Array:   {}\nType:   {}\nShape:   {}\nLength:   {}\nData:   {}\n".format("y",   type(y),   str(np.shape(y)),   len(y),    str(y)))
print ("Array:   {}\nType:   {}\nShape:   {}\nLength:   {}\nData:   {}\n".format("td",  type(td),  str(np.shape(td)),  len(td),   str(td)))
print ("Array:   {}\nType:   {}\nShape:   {}\nLength:   {}\nData:   {}\n".format("n_y", type(n_y), str(np.shape(n_y)), len(n_y),  str(n_y)))
##########################################
plt.show()
虽然看起来工作正常但没有错误,但我仍然关注准确性:回归线总是充满随机角度,形状奇特。我该怎么测试呢?我应该注意哪些错误报告功能?

1 个答案:

答案 0 :(得分:0)

您观察到的是因为您的数据是随机的。回归本质上是恢复生成数据的分布,因此你试图恢复随机生成器的分布具有讽刺意味,它实际上试图隐藏它的分布。

如果要测试回归方法,则应使用互联网上提供的一些常用ML数据集。例如:UCI ML数据集集合(用于回归任务的过滤器):http://archive.ics.uci.edu/ml/datasets.html