读取csv文件并多次热编码张量流中的分类变量

时间:2019-02-08 10:21:07

标签: pandas tensorflow machine-learning data-science

我正在从csv文件读取数据。如果我的功能是分类的,则可以使用以下代码对分类变量进行热编码。

import tensorflow as tf
import tensorflow.feature_column as fc 
import pandas as pd
PATH = "/tmp/sample.csv"

tf.enable_eager_execution()

COLUMNS = ['education','label']
train_df = pd.read_csv(PATH, header=None, names = COLUMNS)

train_df['education'] = train_df['education'].str.split(" ").astype(str)
def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):
  label = df[label_key]
  #ed = tf.string_split(df['education']," ")
  #df['education'] = ed
  ds = tf.data.Dataset.from_tensor_slices((dict(df),label))
  if shuffle:
    ds = ds.shuffle(10000)
  ds = ds.batch(batch_size).repeat(num_epochs)
  return ds

ds = easy_input_function(train_df, label_key='label', num_epochs=5, shuffle=False, batch_size=5)


for feature_batch, label_batch in ds.take(1):
  print('Some feature keys:', list(feature_batch.keys())[:5])
  print()
  print('A batch of education  :', feature_batch['education'])
  print()
  print('A batch of Labels:', label_batch )
  print(feature_batch)

education_vocabulary_list = [
    'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
    'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
    '5th-6th', '10th', '1st-4th', 'Preschool', '12th']  
education = tf.feature_column.categorical_column_with_vocabulary_list('education', vocabulary_list=education_vocabulary_list)

fc.input_layer(feature_batch, [fc.indicator_column(education)])

我的sample.csv文件数据看起来像

Bachelors,1
HS-grad,0

但是当我在分类特征中有多个值时,上面的代码无法对数据进行多热编码。

说我的sample.csv就像

Bachelors HS-grad,1
HS-grad,0

任何人都应该了解如何将变量读取或放入csv文件中,以便能够在模型中对其进行多热编码。

0 个答案:

没有答案