Question

我有一个相对较小的数据集，我使用pandas DataFrame加载到内存中。我想使用批处理将此数据提供给tensorflow模型，同时保持对稀疏（分类）列的支持。我还想避免以其他格式将数据序列化到磁盘。虽然这看起来并不太复杂，但我在文档中找不到一个好的例子，并且在设计合适的input_fn时非常困难。

玩具示例数据集将是：

df = pd.DataFrame(np.random.randint(1, 4, [7, 3]), columns=['c0', 'c1', 'c2'])
df['c1'] = df['c1'].astype(str) + 'g'
df['c2'] = (df['c2'] > 2.5).astype(int)

>>> df
   c0  c1  c2
0   3  3g   1
1   1  1g   0
2   1  2g   0
3   2  2g   1
4   2  3g   0
5   1  3g   0
6   3  1g   0

其中c0是密集的数字列，c1是分类列，c2是二进制标签列。

我的解决方案如下，任何更漂亮和/或更高效的解决方案都会很棒。

Answer 1

这是我的（OP）解决方案。转换和序列化步骤非常慢（每1000个样本大约3秒）。任何更有效率的东西都会非常感激。

import tensorflow as tf

######################################
# Define Feature conversion functions
######################################
def int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[int(value)]))

def float_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=[float(value)]))

def bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[str(value)]))

####################################################
# Define tensorflow data feed from pandas DataFrame
####################################################
def input_fn(df, label_col_name, int_col_names, float_col_names, cat_col_names, num_epochs, batch_size, shuffle=False):
    # Define new column groups
    feature_col_names = int_col_names + float_col_names + cat_col_names
    all_col_names = [label_col_name] + feature_col_names

    # Create conversion and parser dicts
    converters = {}
    parse_dict = {}
    for col in all_col_names:
        if col in cat_col_names:
            converters[col] = bytes_feature
            parse_dict[col] = tf.VarLenFeature(tf.string)
        elif col in float_col_names:
            converters[col] = float_feature
            parse_dict[col] = tf.FixedLenFeature([], tf.float32)
        elif col in int_col_names + [label_col_name]:
            converters[col] = int64_feature
            parse_dict[col] = tf.FixedLenFeature([], tf.int64)

    # Convert DataFrame rows to feature Examples, serialize examples to string
    serialized_examples = []
    for record in df[all_col_names].to_dict('records'):
        feat_record = {k: converters[k](v) for k,v in record.iteritems()}
        example = tf.train.Example(features=tf.train.Features(feature=feat_record))
        serialized_examples.append(example.SerializeToString())

    # Create input queue
    example_queue = tf.train.slice_input_producer([serialized_examples], num_epochs=num_epochs, shuffle=shuffle)

    # Create batch
    example_batch = tf.train.batch(example_queue, batch_size=batch_size, capacity=30, allow_smaller_final_batch=True)

    # Parse batch
    parsed_example_batch = tf.parse_example(example_batch, parse_dict)

    # Split into features and label
    feature_batch = {k: parsed_example_batch[k] for k in feature_col_names}
    label_batch = parsed_example_batch[label_col_name]

    return feature_batch, label_batch

示例用法：

import functools
import numpy as np
import pandas as pd

# Create toy dataset
df = pd.DataFrame(np.random.randint(1, 4, [7, 3]), columns=['c0', 'c1', 'c2'])
df['c1'] = df['c1'].astype(str) + 'g'
df['c2'] = (df['c2'] > 2.5).astype(int)

# Specify feature names
cat_feats = ['c1']
float_feats = []
int_feats = ['c0']
label_feat = 'c2'

# Create parameterless input function
epochs = 3
batch_size = 2
input_fn_train = functools.partial(input_fn, df, label_feat, int_feats, float_feats, cat_feats, epochs, batch_size)

# Define features
continuous_features = [tf.contrib.layers.real_valued_column(feat) for feat in float_feats+int_feats]
categorical_features = [tf.contrib.layers.sparse_column_with_hash_bucket(feat, hash_bucket_size=1000) for feat in cat_feats]
features = continuous_features + categorical_features

# Create and fit model
model = tf.contrib.learn.LinearClassifier(feature_columns=features)
model.fit(input_fn=input_fn_train, steps=1000)

Answer 2

首先，为什么要将它们序列化为tf.Examples，然后使用parse_example反序列化它们？序列化它们，然后批处理它们，然后反序列化它们，则不需要进行任何工作。在张量流中定义输入函数的标准方法是使用tf.data。对于tf.data + tf.Estimators，this documentation可能会有所帮助。特别是在这种情况下，此代码应该起作用：

def input_fn(df, label_feat, num_epochs, batch_size, shuffle=False):
  # Each element of dataset is one row of the dataframe
  dataset = tf.data.Dataset.from_tensor_slices(dict(df))

  def map_fn(element, label_feat):
    # element is a {'c0': int, 'c1': str, 'c2': int} dictionary
    label = element.pop(label_feat)
    return (element, label)

  if shuffle:
    dataset = dataset.shuffle(shuffle_buffer_size)

  # Batch the elements of the dataset
  dataset = dataset.batch(batch_size)
  # Repeat the dataset for num_epochs
  dataset = dataset.repeat(num_epochs)

  # Split it into features, label tuple
  dataset = dataset.map(lambda elem: map_fn(elem, label_feat)

  # One shot iterator iterates through the (repeated) dataset once, 
  # yielding feature_batch, label_batch
  iterator = dataset.make_one_shot_iterator()
  feature_batch, label_batch = iterator.get_next()
  return feature_batch, label_batch

此外，根据您的代码，这里似乎对SparseTensors和sparse_column感到有些困惑。当您使用tf.VarLenFeature时，该要素将解析为SparseTensor，并且仅当所解析的要素具有可变形状时才需要。在这种情况下，您的c1特征都是标量字符串张量，因此FixedLenFeature应该适用于此，即使最终将它们表示为一个稀疏张量，也无需将该特征表示为稀疏张量。 sparse_column。 This documentation告诉您有关稀疏列的更多信息。

Tensorflow - 使用批处理和稀疏/分类数据从DataFrame输入

2 个答案: