我对Python(v。3.7.4)和TensorFlow(v。1.15)都是陌生的,如果我不能以正确的方式解释自己,请原谅我。
我正在尝试在看起来像这样的数据库上训练和评估TensorFlow LinearClassifier
:
问题是二进制分类法,即要预测的标签的最后一栏['CHB'];其他列是“功能”(并非在仿真过程中实际上全部使用了它们)。
我将原始数据库划分为“ train”和“ test”数据库,大约是原始数据库的3/4的“ train”。 “培训”数据库有7048行,“测试”数据库有2350行。
在“功能”列中,我选择了['TSTA', 'SO', 'ATSTA', 'ATD', 'VAL', 'UNI', 'CHD']
。
我还添加了3个交叉列:['VAL x UNI'; 'TSTA x ATSTA x ATD', 'SO x ID']
我在火车数据库上训练了默认的LinearClassifier
(2210个步骤),并在火车数据库上对其进行了评估。
这些是评估获得的指标:
accuracy: 0.97148937;
accuracy_baseline: 0.65446806;
auc: 0.9604878;
auc_precision_recall: 0.9801797;
average_loss: 19186.854;
global_step: 2210;
label/mean: 0.65446806;
loss: 609312.25;
precision: 0.9617075;
prediction/mean: 0.67787236;
recall: 0.9960988
我很担心损失(为609312.5),我从未见过如此高的损失,我想知道为什么这样,尽管其他指标似乎都是“正常”。
知道为什么会这样吗?我还尝试过使用另一种优化器来降低学习率,但是我没有发现任何改进。
这是我的代码:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import pandas as pd
import tensorflow.feature_column as fc
import numpy
tf.enable_eager_execution()
# Path to the .csv files:
train_file = 'path_to_train_csv_file'
test_file = 'path_to_test_csv_file'
# Open the files as pandas DataFarmes
df_train = pd.read_csv(train_file)
df_test = pd.read_csv(test_file)
# Define the columns Index:
COLUMNS = ["TSTR", "TSTA", "SO", "ID", "TID", "ATSTR", "ATSTA", "ATD", "VAL", "UNI", "CHD", "CHB"]
df_train.columns = COLUMNS
df_test.columns = COLUMNS
label = 'CHB' # define the label
# Input function to create the TF DataSet:
def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):
label = df[label_key]
ds = tf.data.Dataset.from_tensor_slices((dict(df), label))
if shuffle:
ds = ds.shuffle(1000)
ds = ds.batch(batch_size).repeat(num_epochs)
return ds
# Define the Feature Columns:
TSTA = fc.numeric_column("TSTA") # time stamps (actual) numeric column
SO = fc.categorical_column_with_hash_bucket("SO", hash_bucket_size=2000) # Source categorical column
id_set_train = set(df_train.ID)
id_set_test = set(df_test.ID)
id_list = list(id_set_test | id_set_train) # this list has all the unique ID strings in the databases
ID = fc.categorical_column_with_vocabulary_list('ID', id_list) # ID categorical column based on the id_list values
tid_set_train = set(df_train.TID)
tid_set_test = set(df_test.TID)
tid_list = list(tid_set_test | tid_set_train) # this list has alla the unique TID strings in the databases
TID = fc.categorical_column_with_vocabulary_list("TID", tid_list) # TID categorical column based on the tid_list values
ATSTA = fc.numeric_column("ATSTA") # time stamps (active) numeric column
ATD = fc.numeric_column("ATD") # time deltas in second (actual-active)
VAL = fc.numeric_column("VAL") # PV value
unit_set_train = set(df_train.UNI)
unit_set_test = set(df_test.UNI)
unit_list = list(unit_set_test | unit_set_train) # this list has alla the unique units of measure in the databases
UNI = fc.categorical_column_with_vocabulary_list("UNI", unit_list)
CHD = fc.numeric_column("CHD") # chattering index time deltas (past)
crossed = [fc.crossed_column(["VAL", "UNI"], hash_bucket_size=int(1e6)),
fc.crossed_column(["TSTA", "ATSTA", "ATD"], hash_bucket_size=int(1e6)),
fc.crossed_column(["SO", "ID"], hash_bucket_size=int(1e6))]
my_numerical_columns = [TSTA, ATSTA, ATD, VAL, CHD]
my_categorical_columns = [SO, ID, UNI]
my_crossed_columns = crossed
classifier = tf.estimator.LinearClassifier(feature_columns=my_categorical_columns + my_crossed_columns + my_numerical_columns)
classifier.train(
steps=2300,
input_fn=lambda: easy_input_function(df_train, label_key=['CHB'], num_epochs=10, shuffle=True, batch_size=32))
result = classifier.evaluate(input_fn=lambda: easy_input_function(df_test, label_key=['CHB'], num_epochs=1,
shuffle=False, batch_size=32))
for key, value in sorted(result.items()):
print('%s: %s' % (key, value))
print("")
我还将附上训练期间损失趋势的屏幕(来自Tensorboard)
对这里发生的事情有任何想法吗?
PS。 您能否解释一下为什么2210个步骤后培训就停止了?我知道这与ephocs,批处理大小和随机播放有关。 您能(简要地)解释一下这些参数如何影响结果,以及(我认为)对于我的问题而言,正确的历元,缓冲区和随机播放值是什么?