CSV>> Tensorflow>>回归(通过神经网络)模型

时间:2016-10-17 19:02:49

标签: python csv numpy tensorflow recurrent-neural-network

无尽的谷歌搜索让我对Python和numpy有了更好的教育,但仍然无法解决我的任务。我想读取整数/浮点值的CSV并使用神经网络预测值。我找到了几个读Iris数据集并进行分类的例子,但我不明白如何使它们适用于回归。有人可以帮我连接点吗?

以下是输入的一行:

  

16804,0,1,0,1,1,0,1,0,1,0,1,0,0,1,1,0,0,1,0,1,0,1,0 ,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,1,0,0,1,0,1,0,1 ,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0 ,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1 ,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0 ,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1 ,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0 ,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1 ,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0 ,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0 ,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0 ,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 ,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0.490265,0.620805,0.54977,0.869299,0.422268,0.351223,0.33572,0.68308,0.40455,0.47779,0.307628,0.301921 ,0.318646,0.365993,6135.81

这应该是925个值。最后一列是输出。第一个是RowID。大多数是二进制值,因为我已经完成了单热编码。测试文件没有输出/最后一列。完整的培训文件大约有10M行。一般的MxN解决方案都可以。

编辑:由于Iris是一个分类问题,让我们使用这个样本数据,但请注意以上是我真正的目标。我删除了ID列。让我们预测给出其他6列的最后一列。这有45行。 (src:http://www.stat.ufl.edu/~winner/data/civwar2.dat

  

100,1861,5,2,3,5,38   112,1863,11,7,4,59.82,15.18   113,1862,34,32,1,79.65,2.65   90,1862,5,2,3,68.89,5.56   93,1862,14,10,4,61.29,17.2   179,1862,22,19,3,62.01,8.89   99,1861,22,16,6,67.68,27.27   111,1862,16,11,4,78.38,8.11   107,1863,17,11,5,60.75,5.61   156,1862,32,30,2,60.9,12.82   152,1862,23,21,2,73.55,6.41   72,1863,7,3,3,54.17,20.83   134,1862,22,21,1,67.91,9.7   180,1862,23,16,4,69.44,3.89   143,1863,23,19,4,8​​1.12,8.39   110,1862,16,12,2,31.82,9.09   157,1862,15,10,5,52.23,24.84   101,1863,4,1,3,58.42,18.81   115,1862,14,11,3,86.96,5.22   103,1862,7,6,1,70.87,0   90,1862,11,11,0,70,4.44   105,1862,20,17,3,80,4.76   104,1862,11,9,1,29.81,9.62   102,1862,17,10,7,49.02,6.86   112,1862,19,14,5,26.79,14.29   87,1862,6,3,3,8.05,72.41   92,1862,4,3,0,11.96,86.96   108,1862,12,7,3,16.67,25   86,1864,0,0,0,2.33,11.63   82,1864,4,3,1,81.71,8.54   76,1864,1,0,1,48.68,6.58   79,1864,0,0,0,15.19,21.52   85,1864,1,1,0,89.41,3.53   85,1864,1,1,0,56.47,0   85,1864,0,0,0,31.76,15.29   87,1864,6,5,0,81.61,3.45   85,1864,5,5,0,72.94,0   83,1864,0,0,0,46.99,2.38   101,1864,5,5,0,1.98,95.05   99,1864,6,6,0,42.42,9.09   10,1864,0,0,0,50,9   98,1864,6,6,0,79.59,3.06   10,1864,0,0,0,71,9   78,1864,5,5,0,70.51,1.28   89,1864,4,4,0,59.55,13.48

让我补充一点,这是一项常见的任务,但我似乎没有通过任何论坛回答,因此我已经问过这个问题。我可以给你破碎的代码,但我不想浪费你的时间来使用功能不正确的代码。对不起,我已经这样问了。我只是不了解API而且文档没有告诉我数据类型。

以下是我将CSV读入两个ndarray的最新代码:

#!/usr/bin/env python
import tensorflow as tf
import csv
import numpy as np
from numpy import genfromtxt

# Build Example Data is CSV format, but use Iris data
from sklearn import datasets
from sklearn.cross_validation import train_test_split
import sklearn
def buildDataFromIris():
    iris = datasets.load_iris()
    data = np.loadtxt(open("t100.csv.out","rb"),delimiter=",",skiprows=0)
    labels = np.copy(data)
    labels = labels[:,924]
    print "labels: ", type (labels), labels.shape, labels.ndim
    data = np.delete(data, [924], axis=1)
    print "data: ", type (data), data.shape, data.ndim

这是我想要使用的基本代码。这个来自的例子也没有完成。以下链接中的API含糊不清。如果我至少可以找出输入到DNNRegressor中的数据类型以及文档中的其他数据类型,我可能会编写一些自定义代码。

estimator = DNNRegressor(
    feature_columns=[education_emb, occupation_emb],
    hidden_units=[1024, 512, 256])

# Or estimator using the ProximalAdagradOptimizer optimizer with
# regularization.
estimator = DNNRegressor(
    feature_columns=[education_emb, occupation_emb],
    hidden_units=[1024, 512, 256],
    optimizer=tf.train.ProximalAdagradOptimizer(
      learning_rate=0.1,
      l1_regularization_strength=0.001
    ))

# Input builders
def input_fn_train: # returns x, Y
  pass
estimator.fit(input_fn=input_fn_train)

def input_fn_eval: # returns x, Y
  pass
estimator.evaluate(input_fn=input_fn_eval)
estimator.predict(x=x)

然后最重要的问题是如何让这些一起工作。

以下是我一直在关注的几页。

2 个答案:

答案 0 :(得分:4)

我发现较低级别的Tensorflow在过去也很难弄清楚。文档并不令人惊讶。如果您专注于抓住sklearn,您应该会发现使用skflow相对容易。 skflow的级别比tensorflow高得多,且{ap}与sklearn几乎相同。

现在回答:

作为回归示例,我们只对虹膜数据集执行回归。现在这是一个愚蠢的想法,但它只是为了演示如何使用DNNRegressor

Skflow API

首次使用新API时,请尝试使用尽可能少的参数。你只想得到一些有用的东西。所以,我建议您可以像这样设置DNNRegressor

estimator = skflow.DNNRegressor(hidden_units=[16, 16])

我保持#hidden单位小,因为我现在没有太多的计算能力。

然后,您可以为其提供培训数据train_X和培训标签train_y,并按照以下方式进行操作:

estimator.fit(train_X, train_y)

这是所有sklearn分类器和回归量的标准程序,skflow只是将tensorflow扩展为与sklearn类似。我还设置了参数steps = 10,以便在仅运行10次迭代时训练完成得更快。

现在,如果您希望它预测某些新数据test_X,请执行以下操作:

pred = estimator.predict(test_X)

同样,这是所有sklearn代码的标准程序。这就是它 - skflow如此简化,你只需要这三行!

train_X和train_y的格式是什么?

如果您对机器学习不太熟悉,那么您的训练数据通常是ndarray(矩阵),大小为M x d,其中您有M个训练样本和d个特征。您的标签为M x 1(形状ndarray的{​​{1}})。

所以你拥有的是这样的:

(M,)

(注意我刚刚提出了所有这些数字)。

测试数据只是一个N x d矩阵,其中有N个测试示例。测试示例都需要具有d功能。预测函数将接收测试数据并返回形状为N x 1(Features: Sepal Width Sepal Length ... Labels [ 5.1 2.5 ] [0 (setosa) ] X = [ 2.3 2.4 ] y = [1 (virginica) ] [ ... ... ] [ .... ] [ 1.3 4.5 ] [2 (Versicolour)] 形状ndarray)的测试标签

您没有提供.csv文件,所以我会让您将数据解析为该格式。不过很方便,我们可以使用(N,)来获取我们想要的sklearn.datsets.load_iris()X。这只是

y

使用回归器作为分类器

iris = datasets.load_iris() X = iris.data y = iris.target 的输出将是一堆实数(如1.6789)。但是虹膜数据集有标签0,1和2 - Setosa,Versicolour和Virginia的整数ID。要使用此回归量进行分类,我们只需将其舍入到最近的标签(0,1,2)。例如,1.6789的预测将四舍五入为2.

工作示例

我发现我用一个有效的例子来学习最多。所以这是一个非常简化的工作示例:

enter image description here

随意发表任何进一步的问题作为评论。

答案 1 :(得分:0)

我最终得到了一些选择。我不知道为什么起床和跑步这么困难。首先,这是基于@ user2570465的代码。

import tensorflow as tf
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
import tensorflow.contrib.learn as skflow

def buildDataFromIris():
    iris = datasets.load_iris()
    return iris.data, iris.target

X, y = buildDataFromIris()
feature_cols = tf.contrib.learn.infer_real_valued_columns_from_input(X)
estimator = skflow.DNNRegressor( feature_columns=feature_cols, hidden_units=[10, 10])
train_X, test_X, train_y, test_y = train_test_split(X, y)
estimator.fit(X, y, steps=10)

test_preds = estimator.predict(test_X)

def CalculateAccuracy(X, y):
    continuous_predictions = estimator.predict(X)
    closest_class = []
    for pred in continuous_predictions:
        differences = np.array([abs(pred-1), abs(pred-1), abs(pred-1)])
        closest_class.append(np.argmin(differences))

    num_correct = np.sum(closest_class == y)
    accuracy = float(num_correct)/len(y)
    return accuracy

train_accuracy = CalculateAccuracy(train_X, train_y)
test_accuracy = CalculateAccuracy(test_X, test_y)

print("Train accuracy: %f" % train_accuracy)
print("Test accuracy: %f" % test_accuracy)

其他解决方案使用较小的组件构建模型。这是一个计算Sig(X * W1 + b1)* W2 + b2 = Y的片段。优化器=亚当,损失= L2,eval = L2和MSE。

x_train = X[:train_size]
y_train = Y[:train_size]
x_val = X[train_size:]
y_val = Y[train_size:]
print("x_train: {}".format(x_train.shape))

x_train = all_x[:train_size]
print("x_train: {}".format(x_train.shape))
# y_train = func(x_train)
# x_val = all_x[train_size:]
# y_val = func(x_val)

# plt.figure(1)
# plt.scatter(x_train, y_train, c='blue', label='train')
# plt.scatter(x_val, y_val, c='red', label='validation')
# plt.legend()
# plt.savefig("../img/nn_mlp1.png")


#build the model
"""
X = [
"""
X = tf.placeholder(tf.float32, [None, n_input], name = 'X')
Y = tf.placeholder(tf.float32, [None, n_output], name = 'Y')

w_h = tf.Variable(tf.random_uniform([n_input, layer1_neurons], minval=-1, maxval=1, dtype=tf.float32))
b_h = tf.Variable(tf.zeros([1, layer1_neurons], dtype=tf.float32))
h = tf.nn.sigmoid(tf.matmul(X, w_h) + b_h)

w_o = tf.Variable(tf.random_uniform([layer1_neurons, 1], minval=-1, maxval=1, dtype=tf.float32))
b_o = tf.Variable(tf.zeros([1, 1], dtype=tf.float32))
model = tf.matmul(h, w_o) + b_o

train_op = tf.train.AdamOptimizer().minimize(tf.nn.l2_loss(model - Y))
tf.nn.l2_loss(model - Y)

#output = sum((model - Y) ** 2)/2
output = tf.reduce_sum(tf.square(model - Y))/2

#launch the session
sess = tf.Session()
sess.run(tf.initialize_all_variables())

errors = []
for i in range(numEpochs):
    for start, end in zip(range(0, len(x_train), batchSize), range(batchSize, len(x_train), batchSize)):
        sess.run(train_op, feed_dict={X: x_train[start:end], Y: y_train[start:end]})
    cost = sess.run(tf.nn.l2_loss(model - y_val), feed_dict={X: x_val})
    errors.append(cost)
    if i%100 == 0: print("epoch %d, cost = %g" % (i,cost))