我计划在较大的数据集上使用TensorForestEstimator
,该数据集将通过对Pandas对象运行的input_fn
进行馈送。
为验证我对API的理解,我整理了一个较小的示例,该示例使用了UC Irvine Machine Learning Repository中的数据集。数据集具有七个功能(六个int32
和一个float32
)和一个标签(int32
)。
当数据集直接通过fit()
和evaluate()
参数作为numpy
数组馈入时,我可以很好地运行x
和y
。 / p>
当我尝试对使用input_fn
中来自tf.estimator.inputs.pandas_input_fn
的数据执行相同的操作并将tf.contrib.layers
特征列提供给feature_columns
参数时,我观察到tensorflow/contrib/tensor_forest/python/ops/data_ops.py
中的值错误:
TypeError: '<' not supported between instances of '_RealValuedColumn' and 'str'
这是因为在sorted()
和TensorFlow对象的字典键列表中都调用了str
。
从Jupyter笔记本中导出的代码在这篇文章的结尾给出。
对于为什么会发生这种情况的任何见解将不胜感激。我已经在文档,StackOverflow和GitHub问题记录中进行了很多搜索,但根本原因还没有归零。
谢谢!
TensorForestEstimator
的{{1}}的示例代码pandas_input_fn
import csv
import numpy as np
import pandas as pd
import random
import tensorflow as tf
import tensorflow.contrib.layers as layers
import tensorflow.contrib.tensor_forest as tforest
from tensorflow.estimator.inputs import pandas_input_fn
from tensorflow.python.platform import tf_logging as logging
COLUMN_PROPS = {
'sex' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.int32,
'default' : -1,
'feature_column' : layers.real_valued_column(
'sex',
dtype=tf.int32
)
},
'age' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.int32,
'default' : -1,
'feature_column' : layers.real_valued_column(
'age',
dtype=tf.int32
)
},
'Time' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.float32,
'default' : -1.0,
'feature_column' : layers.real_valued_column(
'Time',
dtype=tf.float32
)
},
'Number_of_Warts' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.int32,
'default' : -1,
'feature_column' : layers.real_valued_column(
'Number_of_Warts',
dtype=tf.int32
),
},
'Type' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.int32,
'default' : -1,
'feature_column' : layers.real_valued_column(
'Type',
dtype=tf.int32
)
},
'Area' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.int32,
'default' : -1,
'feature_column' : layers.real_valued_column(
'Area',
dtype=tf.int32
)
},
'induration_diameter' : {
'is_feature' : True,
'is_label' : False,
'dtype': tf.int32,
'default': -1,
'feature_column' : layers.real_valued_column(
'induration_diameter',
dtype=tf.int32
)
},
'Result_of_Treatment': {
'is_feature' : False,
'is_label' : True,
'dtype': tf.int32,
'default': -1,
'feature_column' : None
}
}
CSV_COLUMNS = [
'sex',
'age',
'Time',
'Number_of_Warts',
'Type',
'Area',
'induration_diameter',
'Result_of_Treatment'
]
此功能用于将训练,评估和测试数据集导出为CSV,从而对行进行混排。
FEATURE_COLUMNS = []
LABEL_COLUMN = None
for k in CSV_COLUMNS:
if COLUMN_PROPS[k]['is_feature']:
FEATURE_COLUMNS.append(k)
elif COLUMN_PROPS[k]['is_label']:
LABEL_COLUMN = k
def generate_sets(datasets):
for k, v in datasets.items():
random.shuffle(v)
with open(k + '.csv', 'w') as fobj:
wrtr = csv.writer(fobj)
wrtr.writerow(header)
for rec in v:
wrtr.writerow(rec)
trn = []
evl = []
tst = []
with open('Immunotherapy - ImmunoDataset.csv', 'r') as fobj:
rdr = csv.reader(fobj)
header = next(rdr)
label_key = header[-1]
feature_keys = header[:-1]
for rec in rdr:
# Output of random number generator determines
# which set the record will be placed.
rn = random.random()
if rn < 0.6:
trn.append(rec)
elif rn < 0.8:
evl.append(rec)
else:
tst.append(rec)
datasets = {
'train' : trn,
'eval' : evl,
'test' : tst
}
generate_sets(datasets)
超参数TensorForest
fhp = tforest.tensor_forest.ForestHParams(
num_classes=2,
num_features=7,
regression=False
)
fcs = [COLUMN_PROPS[k]['feature_column'] for k in FEATURE_COLUMNS]
对象TensorForestEstimator
tfe = tforest.random_forest.TensorForestEstimator(
fhp,
feature_columns=fcs,
report_feature_importances=True
)
定义包装器pandas_input_fn
def get_input_fn(csv_file):
df = pd.read_csv(csv_file)
features = df.loc[:,'sex':'induration_diameter']
# Workaround for this issue:
#
# https://stackoverflow.com/questions/48577372/tensorflowusing-pandas-input-fn-with-tensorforestestimator
# https://github.com/tensorflow/tensorflow/issues/16692
labels = pd.DataFrame(
np.expand_dims(
df.loc[:,'Result_of_Treatment'].values, axis=1
)
)
return pandas_input_fn(x=features, y=labels, shuffle=False)
答案 0 :(得分:0)
经过进一步测试,我相信这是TensorForestEstimator
中的错误。可以在GitHub Issue的以下URL上找到更多详细信息: