Question

我开始使用Tensorflow项目，并且正在定义和创建我的功能列。但是，我有数百个功能 - 它是一个相当广泛的数据集。即使经过预处理和清理，我也有很多专栏。

创建feature_column的传统方式在Tensorflow tutorial甚至此StackOverflow post中定义。您基本上为每个功能列声明并初始化Tensorflow对象：

gender = tf.feature_column.categorical_column_with_vocabulary_list(
    "gender", ["Female", "Male"])

如果您的数据集只有几列，这一切都很好，但在我的情况下，我肯定不希望有数百行代码初始化不同的feature_column个对象。

解决此问题的最佳方法是什么？我注意到在教程中，所有列都是作为列表收集的：

base_columns = [
    gender, native_country, education, occupation, workclass, relationship,
    age_buckets,
]

最终会将其传递到您的估算工具中：

m = tf.estimator.LinearClassifier(
    model_dir=model_dir, feature_columns=base_columns)

为数百列处理feature_column创建的理想方法是将它们直接附加到列表中吗？像这样的东西？

my_columns = []

for col in df.columns:
    if is_string_dtype(df[col]): #is_string_dtype is pandas function
        my_column.append(tf.feature_column.categorical_column_with_hash_bucket(col, 
            hash_bucket_size= len(df[col].unique())))

    elif is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
        my_column.append(tf.feature_column.numeric_column(col))

这是创建这些功能列的最佳方式吗？或者我错过了Tensorflow的一些功能，让我可以解决这个问题？

Answer 1

你对我有意义。 :)从你自己的代码复制：

my_columns = []

for col in df.columns:
  if is_string_dtype(df[col]): #is_string_dtype is pandas function
    my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col, 
        hash_bucket_size= len(df[col].unique())))

  elif is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
    my_columns.append(tf.feature_column.numeric_column(col))

Answer 2

我用你自己的答案。刚刚编辑了一下（在my_columns循环中应该my_column而不是for）并按照它为我工作的方式发布。

import pandas.api.types as ptypes

my_columns = []

for col in df.columns:
  if ptypes.is_string_dtype(df[col]): #is_string_dtype is pandas function
    my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col, 
        hash_bucket_size= len(df[col].unique())))

  elif ptypes.is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
    my_columns.append(tf.feature_column.numeric_column(col))

Answer 3

只有在熊猫数据框中提供数据的情况下，上述两种方法才有效，在该数据框中您具有每列的列名称。但是，如果您拥有所有数字列，并且不想命名这些列。例如从numpy数组中读取几个数字列，您可以使用如下代码：-

feature_column = [tf.feature_column.numeric_column(key='image',shape=(784,))] 

input_fn = tf.estimator.inputs.numpy_input_fn(dict({'image':x_train})

其中X_train是具有784列的numy数组。您可以查看Vikas Sangwan的post，以了解更多详细信息。

在Tensorflow中创建许多功能列

3 个答案: