具有feature_column的TensorForestEstimator引发TypeError

时间:2019-02-25 01:35:03

标签: python csv numpy classification tflearn

我计划在较大的数据集上使用TensorForestEstimator,该数据集将通过对Pandas对象运行的input_fn进行馈送。

为验证我对API的理解,我整理了一个较小的示例,该示例使用了UC Irvine Machine Learning Repository中的数据集。数据集具有七个功能(六个int32和一个float32)和一个标签(int32)。

当数据集直接通过fit()evaluate()参数作为numpy数组馈入时,我可以很好地运行xy。 / p>

当我尝试对使用input_fn中来自tf.estimator.inputs.pandas_input_fn的数据执行相同的操作并将tf.contrib.layers特征列提供给feature_columns参数时,我观察到tensorflow/contrib/tensor_forest/python/ops/data_ops.py中的值错误:

TypeError: '<' not supported between instances of '_RealValuedColumn' and 'str'

这是因为在sorted()和TensorFlow对象的字典键列表中都调用了str

从Jupyter笔记本中导出的代码在这篇文章的结尾给出。

对于为什么会发生这种情况的任何见解将不胜感激。我已经在文档,StackOverflow和GitHub问题记录中进行了很多搜索,但根本原因还没有归零。

谢谢!

带有TensorForestEstimator的{​​{1}}的示例代码

Python标准库导入

pandas_input_fn

TensorFlow库导入

import csv
import numpy as np
import pandas as pd
import random

别名TensorFlow库导入

import tensorflow as tf
import tensorflow.contrib.layers as layers
import tensorflow.contrib.tensor_forest as tforest

CSV列的元数据

from tensorflow.estimator.inputs import pandas_input_fn
from tensorflow.python.platform import tf_logging as logging

CSV列的排序

COLUMN_PROPS = {
    'sex' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.int32,
        'default' : -1,
        'feature_column' : layers.real_valued_column(
            'sex',
            dtype=tf.int32
        )
    },
    'age' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.int32,
        'default' : -1,
        'feature_column' : layers.real_valued_column(
            'age',
            dtype=tf.int32
        )  
    },
    'Time' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.float32,
        'default' : -1.0,
        'feature_column' : layers.real_valued_column(
            'Time',
            dtype=tf.float32
        )
    },
    'Number_of_Warts' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.int32,
        'default' : -1,
        'feature_column' : layers.real_valued_column(
            'Number_of_Warts',
            dtype=tf.int32
        ),
    },
    'Type' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.int32,
        'default' : -1,
        'feature_column' : layers.real_valued_column(
            'Type',
            dtype=tf.int32
        )
    },
    'Area' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.int32,
        'default' : -1,
        'feature_column' : layers.real_valued_column(
            'Area',
            dtype=tf.int32
        )
    },
    'induration_diameter' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype': tf.int32,
        'default': -1,
        'feature_column' : layers.real_valued_column(
            'induration_diameter',
            dtype=tf.int32
        )
    },
    'Result_of_Treatment': {
        'is_feature' : False,
        'is_label' : True,
        'dtype': tf.int32,
        'default': -1,
        'feature_column' : None
    }
}

从元数据生成功能和标签列表

CSV_COLUMNS = [
    'sex',
    'age',
    'Time',
    'Number_of_Warts',
    'Type',
    'Area',
    'induration_diameter',
    'Result_of_Treatment'
]

用于改组和导出子集的帮助器功能

此功能用于将训练,评估和测试数据集导出为CSV,从而对行进行混排。

FEATURE_COLUMNS = []
LABEL_COLUMN = None

for k in CSV_COLUMNS:
    if COLUMN_PROPS[k]['is_feature']:
        FEATURE_COLUMNS.append(k)
    elif COLUMN_PROPS[k]['is_label']:
        LABEL_COLUMN = k

用于跟踪,评估和测试的分割数据集

def generate_sets(datasets):
    for k, v in datasets.items():
        random.shuffle(v)
        with open(k + '.csv', 'w') as fobj:
            wrtr = csv.writer(fobj)
            wrtr.writerow(header)
            for rec in v:
                wrtr.writerow(rec)

设置trn = [] evl = [] tst = [] with open('Immunotherapy - ImmunoDataset.csv', 'r') as fobj: rdr = csv.reader(fobj) header = next(rdr) label_key = header[-1] feature_keys = header[:-1] for rec in rdr: # Output of random number generator determines # which set the record will be placed. rn = random.random() if rn < 0.6: trn.append(rec) elif rn < 0.8: evl.append(rec) else: tst.append(rec) datasets = { 'train' : trn, 'eval' : evl, 'test' : tst } generate_sets(datasets) 超参数

TensorForest

元数据字典中的采摘特征列

fhp = tforest.tensor_forest.ForestHParams(
    num_classes=2,
    num_features=7,
    regression=False
)

实例化fcs = [COLUMN_PROPS[k]['feature_column'] for k in FEATURE_COLUMNS] 对象

TensorForestEstimator

tfe = tforest.random_forest.TensorForestEstimator( fhp, feature_columns=fcs, report_feature_importances=True ) 定义包装器

pandas_input_fn

数据培训

def get_input_fn(csv_file):

    df = pd.read_csv(csv_file)

    features = df.loc[:,'sex':'induration_diameter']

    # Workaround for this issue:
    #
    # https://stackoverflow.com/questions/48577372/tensorflowusing-pandas-input-fn-with-tensorforestestimator
    # https://github.com/tensorflow/tensorflow/issues/16692

    labels = pd.DataFrame(
        np.expand_dims(
            df.loc[:,'Result_of_Treatment'].values, axis=1
        )
    )

    return pandas_input_fn(x=features, y=labels, shuffle=False)

1 个答案:

答案 0 :(得分:0)

经过进一步测试,我相信这是TensorForestEstimator中的错误。可以在GitHub Issue的以下URL上找到更多详细信息:

https://github.com/tensorflow/tensorflow/issues/26082