为什么我在Ames Housing数据集上完成的Tensorflow中的线性回归实现非常缓慢?

时间:2017-07-04 23:48:05

标签: python machine-learning tensorflow linear-regression kaggle

我正尝试在Kaggle上提供的Ames Housing数据集上使用线性回归。

我首先通过删除许多功能来手动清理数据。然后,我使用以下实现进行训练。

train_size = np.shape(x_train)[0]
valid_size = np.shape(x_valid)[0]
test_size = np.shape(x_test)[0]
num_features = np.shape(x_train)[1]

graph = tf.Graph()
with graph.as_default():

    # Input
    tf_train_dataset = tf.constant(x_train)
    tf_train_labels = tf.constant(y_train)
    tf_valid_dataset = tf.constant(x_valid)
    tf_test_dataset = tf.constant(x_test)

    # Variables
    weights = tf.Variable(tf.truncated_normal([num_features, 1]))
    biases = tf.Variable(tf.zeros([1]))

    # Loss Computation
    train_prediction = tf.matmul(tf_train_dataset, weights) + biases
    loss = tf.losses.mean_squared_error(tf_train_labels, train_prediction)

    # Optimizer
    # Gradient descent optimizer with learning rate = alpha
    alpha = tf.constant(0.000000003, dtype=tf.float64)
    optimizer = tf.train.GradientDescentOptimizer(alpha).minimize(loss)

    # Predictions
    valid_prediction = tf.matmul(tf_valid_dataset, weights) + biases
    test_prediction = tf.matmul(tf_test_dataset, weights) + biases

这就是我的图表的运行方式:

num_steps = 10001

def accuracy(prediction, labels):
    return ((prediction - labels) ** 2).mean(axis=None)

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    for step in range(num_steps):
        _, l, predictions = session.run([optimizer, loss, train_prediction])
        if (step % 1000 == 0):
            print('Loss at step %d: %f' % (step, l))
            print('Validation accuracy: %.1f%%' % accuracy(valid_prediction.eval(), y_valid))
     t_pred = test_prediction.eval()
     print('Test accuracy: %.1f%%' % accuracy(t_pred, y_test))

这是我尝试过的:

  1. 我尝试提高学习率。但是,如果我将学习速度提高到超出我现在使用的水平,模型就无法收敛,即损失会爆炸到无穷大。

  2. 将迭代次数增加到10,000,000次。迭代越长,损失收敛越慢(这是可以理解的)。但我离合理的价值还很远。损失通常是一个10位数字

  3. 我是否对图表做错了什么?或者线性回归是一个不好的选择,我应该尝试使用另一种算法?非常感谢任何帮助和建议!

1 个答案:

答案 0 :(得分:1)

工作代码

import csv
import tensorflow as tf
import numpy as np

with open('train.csv', 'rt') as f:
    reader = csv.reader(f)
    your_list = list(reader)

def toFloatNoFail( data ) :
    try :
        return float(data)
    except :
        return 0

data = [ [ toFloatNoFail(x) for x in row ] for row in your_list[1:] ]
data = np.array( data ).astype( float )
x_train = data[:,:-1]
print x_train.shape
y_train = data[:,-1:]
print y_train.shape


num_features = np.shape(x_train)[1]

# Input
tf_train_dataset = tf.constant(x_train, dtype=tf.float32)
tf_train_labels = tf.constant(y_train, dtype=tf.float32)

# Variables
weights = tf.Variable(tf.truncated_normal( [num_features, 1] , dtype=tf.float32))
biases = tf.Variable(tf.constant(0.0, dtype=tf.float32 ))

train_prediction = tf.matmul(tf_train_dataset, weights) + biases

loss = tf.reduce_mean( tf.square( tf.log(tf_train_labels) - tf.log(train_prediction) ))

optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

num_steps = 10001

def accuracy(prediction, labels):
    return ((prediction - labels) ** 2).mean(axis=None)


with tf.Session() as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    for step in range(num_steps):
        _, l, predictions = session.run([optimizer, loss, train_prediction])
        if (step % 1000 == 0):
            print('Loss at step %d: %f' % (step, l))

关键变更说明

您的损失功能没有按比例缩放。上述损失函数考虑到您实际上只对以原始价格缩放的价格中的错误感兴趣。因此,对于一个价值5000万美元的房子而言,5000美元的房子不会像5000美元的房子一样被损失5000美元。

新的损失函数是:

loss = tf.reduce_mean( tf.square( tf.log(tf_train_labels) - tf.log(train_prediction) ))