Question

我用scikitlearn和tensorflow实现了简单的线性回归。

我在scikitlearn中的解决方案似乎不错，但是使用tensorflow时，我的评估输出显示了一些疯狂的数字。

问题基本上是试图根据多年的经验来预测薪水。

我不确定我在Tensorflow的代码中做错了什么。

谢谢！

ScikitLearn解决方案

import pandas as pd
data = pd.read_csv('Salary_Data.csv') 

X = data.iloc[:, :-1].values
y = data.iloc[:, 1].values

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

X_single_data = [[4.6]]
y_single_pred = regressor.predict(X_single_data)

print(f'Train score: {regressor.score(X_train, y_train)}')
print(f'Test  score: {regressor.score(X_test, y_test)}')

火车得分：0.960775692121653

考试成绩：0.9248580247217076

Tensorflow解决方案

import tensorflow as tf

f_cols = [tf.feature_column.numeric_column(key='X', shape=[1])]
estimator = tf.estimator.LinearRegressor(feature_columns=f_cols)


train_input_fn = tf.estimator.inputs.numpy_input_fn(x={'X': X_train}, y=y_train,shuffle=False)

test_input_fn = tf.estimator.inputs.numpy_input_fn(x={'X': X_test}, y=y_test,shuffle=False)


train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn)
eval_spec = tf.estimator.EvalSpec(input_fn=test_input_fn)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

（{{'average_loss'：7675087400.0，

“标签/平均值”：84588.11，

“损失”：69075790000.0，

“预测/平均值”：5.0796494，

“ global_step”：6}，

[]）

数据

YearsExperience,Salary
1.1,39343.00
1.3,46205.00
1.5,37731.00
2.0,43525.00
2.2,39891.00
2.9,56642.00
3.0,60150.00
3.2,54445.00
3.2,64445.00
3.7,57189.00
3.9,63218.00
4.0,55794.00
4.0,56957.00
4.1,57081.00
4.5,61111.00
4.9,67938.00
5.1,66029.00
5.3,83088.00
5.9,81363.00
6.0,93940.00
6.8,91738.00
7.1,98273.00
7.9,101302.00
8.2,113812.00
8.7,109431.00
9.0,105582.00
9.5,116969.00
9.6,112635.00
10.3,122391.00
10.5,121872.00

Answer 1

我无法在评论中放置图片，因此请将其放在此处。我怀疑这种关系可能是S型而不是线性的，并发现以下S型方程和以千为单位的工资的拟合统计值：“ y = a /（1.0 + exp（-（xb）/ c）））”具有拟合参数a = 1.5535069418318591E + 02，b = 5.4580059234664899E + 00，c = 3.7724942500630938E + 00，得出R平方= 0.96，RMSE = 5.30（千）

Answer 2

根据注释中您的代码请求：尽管我在http://zunzun.com/Equation/2/Sigmoidal/Sigmoid%20B/中使用了在线曲线和曲面拟合网站zunzun.com进行此方程式的建模工作，但这是使用scipy的图形化源代码示例差分进化遗传算法模块，用于估计初始参数。差异演化的科学实现使用拉丁Hypercube算法来确保对参数空间进行彻底搜索，这需要在搜索范围内进行搜索-在此示例中，这些限制是从数据最大值和最小值以及拟合统计量和参数值中获取的与网站上的内容几乎相同。

import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings

xData = numpy.array([ 1.1, 1.3, 1.5, 2.0, 2.2, 2.9, 3.0, 3.2, 3.2, 3.7, 3.9, 4.0, 4.0, 4.1, 4.5, 4.9, 5.1, 5.3, 5.9, 6.0, 6.8, 7.1, 7.9, 8.2, 8.7, 9.0, 9.5, 9.6, 10.3, 10.5])
yData = numpy.array([ 39.343, 46.205, 37.731, 43.525, 39.891, 56.642, 60.15, 54.445, 64.445, 57.189, 63.218, 55.794, 56.957, 57.081, 61.111, 67.938, 66.029, 83.088, 81.363, 93.94, 91.738, 98.273, 101.302, 113.812, 109.431, 105.582, 116.969, 112.635, 122.391, 121.872])


def func(x, a, b, c):
    return a / (1.0 + numpy.exp(-(x-b)/c))


# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
    warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
    val = func(xData, *parameterTuple)
    return numpy.sum((yData - val) ** 2.0)


def generate_Initial_Parameters():
    # min and max used for bounds
    maxX = max(xData)
    minX = min(xData)
    maxY = max(yData)
    minY = min(yData)

    parameterBounds = []
    parameterBounds.append([minY, maxY]) # search bounds for a
    parameterBounds.append([minX, maxX]) # search bounds for b
    parameterBounds.append([minX, maxX]) # search bounds for c

    # "seed" the numpy random number generator for repeatable results
    result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
    return result.x

# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()

# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()

modelPredictions = func(xData, *fittedParameters) 

absError = modelPredictions - yData

SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))

print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)

print()


##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
    f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
    axes = f.add_subplot(111)

    # first the raw data as a scatter plot
    axes.plot(xData, yData,  'D')

    # create data for the fitted equation plot
    xModel = numpy.linspace(min(xData), max(xData))
    yModel = func(xModel, *fittedParameters)

    # now the model as a line plot
    axes.plot(xModel, yModel)

    axes.set_xlabel('Years of experience') # X axis data label
    axes.set_ylabel('Salary in thousands') # Y axis data label

    plt.show()
    plt.close('all') # clean up after using pyplot

graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

Tensorflow和Scikit学习：相同的解决方案但输出不同

2 个答案: