Question

我正在尝试按照 this 教程预处理臭名昭著的泰坦尼克号数据（来自 Kaggle）。一切正常，直到我在数据上运行 titanic_processing 模型 (titanic_features) 并出现此错误：

<块引用>

ValueError: 无法将 NumPy 数组转换为张量（不支持的对象类型浮点数）。

在教程中提到应该将数据转换为张量字典，但是：

我看不到代码（请参阅下面代码中的 HERE1 标记）如何生成张量字典（例如，没有 tf.convert_to_tensor）
我不明白为什么要重新转换所有数据，因为之前的代码应该这样做（当创建preprocessed_inputs 等时）

这是我的代码，但您也可以在 Google Colab here 上执行它。

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing


url = "https://raw.githubusercontent.com/aymeric75/IA/master/train.csv"
titanic = pd.read_csv(url)


titanic_features = titanic.copy()
titanic_labels = titanic_features.pop('Survived')


inputs = {}

for name, column in titanic_features.items():
    dtype = column.dtype
    if dtype == object:
        dtype = tf.string
    else:
        dtype = tf.float32
    inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)

numeric_inputs = {name:input for name,input in inputs.items()
                  if input.dtype==tf.float32}

x = layers.Concatenate()(list(numeric_inputs.values()))
norm = preprocessing.Normalization()
norm.adapt(np.array(titanic[numeric_inputs.keys()]))

all_numeric_inputs = norm(x)
preprocessed_inputs = [all_numeric_inputs]


for name, input in inputs.items():
    if input.dtype == tf.float32:
        continue
    
    lookup = preprocessing.StringLookup(vocabulary=np.unique(titanic_features[name].dropna()))
    one_hot = preprocessing.CategoryEncoding(max_tokens=lookup.vocab_size())

    x = lookup(input)
    x = one_hot(x)
    preprocessed_inputs.append(x)


preprocessed_inputs_cat = layers.Concatenate()(preprocessed_inputs)
titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs_cat)

titanic_features_dict = {}

# This model just contains the input preprocessing. You can run it to see what it does to your data.
# Keras models don't automatically convert Pandas DataFrames because
# it's not clear if it should be converted to one tensor or to a dictionary of tensors. So convert it to a dictionary of tensors:
# HERE1

titanic_features_dict = {name: np.array(value) 
                         for name, value in titanic_features.items()}

features_dict = {name:values[:1] for name, values in titanic_features_dict.items()}

titanic_preprocessing(features_dict)

非常感谢您的支持！

艾默里克

[更新] 如果您能回答问题 2（“我不明白为什么要重新转换所有数据，因为之前的代码应该这样做（当一个人创建 preprocessed_inputs 等”）然后我会验证你的答案，因为我认为我确实需要重新格式化输入（但我不明白之前做所有代码的意义......）

Answer 1

在您的情况下，问题是由于您的特征“Cabin”包含一些 nan（非数字）值。 Tensorflow 适用于浮点和整数数据类型的 nan，但不适用于字符串。

您可以用熊猫数据框中的空字符串替换所有这些 nan 值：

titanic_features["Cabin"] = titanic_features["Cabin"].fillna("")

前面的代码只是将预处理函数声明为 keras 模型。在调用 titanic_preprocessing 模型之前，您实际上不会预处理任何数据。

将张量字典传递给 Keras 模型

1 个答案: