多目标和多类预测

时间:2017-08-04 19:52:02

标签: tensorflow

我对机器学习和张量流程都比较陌生。我想训练数据,以便可以进行具有2个目标和多个类的预测。这是可以做到的吗?我能够为1个目标实现该算法,但不知道我是如何为第二个目标做的。

示例数据集:     DayOfYear温度流量可见性

316 8   1   4
285 -1  1   4
326 8   2   5
323 -1  0   3
10  7   3   6
62  8   0   3
56  8   1   4
347 7   2   5
363 7   0   3
77  7   3   6
1   7   1   4
308 -1  2   5
364 7   3   6

如果我训练(DayOfYear Temperature Flow),我可以很好地预测可见度。但我需要以某种方式预测Flow。我很确定Flow会影响Visibility,所以我不知道该怎么做。

这是我的实施

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import urllib

import numpy as np
import tensorflow as tf

# Data sets
TRAINING = "/ml_baetterich_learn.csv"
TEST = "/ml_baetterich_test.csv"
VALIDATION = "/ml_baetterich_validation.csv"

def main():

  # Load datasets.
  training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
      filename=TRAINING,
      target_dtype=np.int,
      features_dtype=np.int,
      target_column=-1)
  test_set = tf.contrib.learn.datasets.base.load_csv_without_header(
      filename=TEST,
      target_dtype=np.int,
      features_dtype=np.int,
      target_column=-1)
  validation_set = tf.contrib.learn.datasets.base.load_csv_without_header(
      filename=VALIDATION,
      target_dtype=np.int,
      features_dtype=np.int,
      target_column=-1)

  # Specify that all features have real-value data
  feature_columns = [tf.contrib.layers.real_valued_column("", dimension=3)]

  # Build 3 layer DNN with 10, 20, 10 units respectively.
  classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
                                              hidden_units=[10, 20, 10],
                                              n_classes=9,
                                              model_dir="/tmp/iris_model")
  # Define the training inputs
  def get_train_inputs():
    x = tf.constant(training_set.data)
    y = tf.constant(training_set.target)

    return x, y

  # Fit model.
  classifier.fit(input_fn=get_train_inputs, steps=4000)

  # Define the test inputs
  def get_test_inputs():
    x = tf.constant(test_set.data)
    y = tf.constant(test_set.target)

    return x, y

  # Define the test inputs
  def get_validation_inputs():
    x = tf.constant(validation_set.data)
    y = tf.constant(validation_set.target)

    return x, y

  # Evaluate accuracy.
  accuracy_test_score = classifier.evaluate(input_fn=get_test_inputs,
                                       steps=1)["accuracy"]

  accuracy_validation_score = classifier.evaluate(input_fn=get_validation_inputs,
                                       steps=1)["accuracy"]

  print ("\nValidation Accuracy: {0:0.2f}\nTest Accuracy: {1:0.2f}\n".format(accuracy_validation_score,accuracy_test_score))

  # Classify two new flower samples.
  def new_samples():
    return np.array(
      [[327,8,3],
       [47,8,0]], dtype=np.float32)

  predictions = list(classifier.predict_classes(input_fn=new_samples))

  print(
      "New Samples, Class Predictions:    {}\n"
      .format(predictions))

if __name__ == "__main__":
    main()

1 个答案:

答案 0 :(得分:6)

选项1:多头模型

您可以使用多头DNNEstimator模型。这将Flow和Visibility视为两个单独的softmax分类目标,每个目标都有自己的一组类。我不得不修改load_csv_without_header辅助函数以支持多个目标(这可能更清晰,但这不是重点 - 可以随意忽略它的细节)。

import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
import csv
import collections

num_flow_classes = 4
num_visib_classes = 7

Dataset = collections.namedtuple('Dataset', ['data', 'target'])

def load_csv_without_header(fn, target_dtype, features_dtype, target_columns):
    with gfile.Open(fn) as csv_file:
        data_file = csv.reader(csv_file)
        data = []
        targets = {
            target_cols: []
            for target_cols in target_columns.keys()
        }
        for row in data_file:
            cols = sorted(target_columns.items(), key=lambda tup: tup[1], reverse=True)
            for target_col_name, target_col_i in cols:
                targets[target_col_name].append(row.pop(target_col_i))
            data.append(np.asarray(row, dtype=features_dtype))

        targets = {
            target_col_name: np.array(val, dtype=target_dtype)
            for target_col_name, val in targets.items()
        }
        data = np.array(data)
        return Dataset(data=data, target=targets)

feature_columns = [
    tf.contrib.layers.real_valued_column("", dimension=1),
    tf.contrib.layers.real_valued_column("", dimension=2),
]
head = tf.contrib.learn.multi_head([
    tf.contrib.learn.multi_class_head(
        num_flow_classes, label_name="Flow", head_name="Flow"),
    tf.contrib.learn.multi_class_head(
        num_visib_classes, label_name="Visibility", head_name="Visibility"),
])
classifier = tf.contrib.learn.DNNEstimator(
    feature_columns=feature_columns,
    hidden_units=[10, 20, 10],
    model_dir="iris_model",
    head=head,
)

def get_input_fn(filename):
    def input_fn():
        dataset = load_csv_without_header(
            fn=filename,
            target_dtype=np.int,
            features_dtype=np.int,
            target_columns={"Flow": 2, "Visibility": 3}
        )
        x = tf.constant(dataset.data)
        y = {k: tf.constant(v) for k, v in dataset.target.items()}
        return x, y
    return input_fn

classifier.fit(input_fn=get_input_fn("tmp_train.csv"), steps=4000)
res = classifier.evaluate(input_fn=get_input_fn("tmp_test.csv"), steps=1)

print("Validation:", res)

选项2:多标记头

如果您使用逗号分隔CSV数据,并保留行可能具有的所有类的最后一列(由某些标记(如空格)分隔),则可以使用以下代码:

import numpy as np
import tensorflow as tf

all_classes = ["0", "1", "2", "3", "4", "5", "6"]

def k_hot(classes_col, all_classes, delimiter=' '):
    table = tf.contrib.lookup.index_table_from_tensor(
        mapping=tf.constant(all_classes)
    )
    classes = tf.string_split(classes_col, delimiter)
    ids = table.lookup(classes)
    num_items = tf.cast(tf.shape(ids)[0], tf.int64)
    num_entries = tf.shape(ids.indices)[0]

    y = tf.SparseTensor(
        indices=tf.stack([ids.indices[:, 0], ids.values], axis=1),
        values=tf.ones(shape=(num_entries,), dtype=tf.int32),
        dense_shape=(num_items, len(all_classes)),
    )
    y = tf.sparse_tensor_to_dense(y, validate_indices=False)
    return y

def feature_engineering_fn(features, labels):
    labels = k_hot(labels, all_classes)
    return features, labels

feature_columns = [
    tf.contrib.layers.real_valued_column("", dimension=1), # DayOfYear
    tf.contrib.layers.real_valued_column("", dimension=2), # Temperature
]
classifier = tf.contrib.learn.DNNEstimator(
    feature_columns=feature_columns,
    hidden_units=[10, 20, 10],
    model_dir="iris_model",
    head=tf.contrib.learn.multi_label_head(n_classes=len(all_classes)),
    feature_engineering_fn=feature_engineering_fn,
)

def get_input_fn(filename):
    def input_fn():
        dataset = tf.contrib.learn.datasets.base.load_csv_without_header(
            filename=filename,
            target_dtype="S100", # strings of length up to 100 characters
            features_dtype=np.int,
            target_column=-1
        )
        x = tf.constant(dataset.data)
        y = tf.constant(dataset.target)
        return x, y
    return input_fn

classifier.fit(input_fn=get_input_fn("tmp_train.csv"), steps=4000)
res = classifier.evaluate(input_fn=get_input_fn("tmp_test.csv"), steps=1)

print("Validation:", res)

我们正在使用DNNEstimatormulti_label_head,它使用sigmoid crossentropy而不是softmax crossentropy作为损失函数。这意味着每个输出单元/ logits都通过sigmoid函数,这给出了数据点属于该类的可能性,即类是独立计算的,并且与softmax交叉熵一样不是互斥的。这意味着您可以为训练集和最终预测中的每一行设置0到len(all_classes)个类。

另请注意,类表示为字符串(并且k_hot转换为标记索引),因此您可以在电子商务设置中使用任意类标识符,例如类别UUID。如果第3列和第4列中的类别不同(流ID 1!=可见性ID 1),则可以将列名添加到每个类ID,例如

316,8,flow1 visibility4 285,-1,flow1 visibility4 326,8,flow2 visibility5

有关k_hot工作原理的说明,请参阅my other SO answer。我决定使用k_hot作为单独的函数(而不是直接在feature_engineering_fn中定义它,因为它是一个独特的功能,并且TensorFlow可能很快就会有类似的实用函数。

请注意,如果您现在使用前两列来预测最后两列,那么您的准确度肯定会下降,因为最后两列高度相关并且使用其中一列会给您带来很多关于另一个的信息。实际上,你的代码只使用了第3列,如果目标是预测第3和第4列,这无论如何都是一种欺骗。