我正在从Tensor流量估计中研究DNNClassifier,并且使用的数据集是JM1(缺陷预测)
考虑的培训功能0:8163(defects-free:6056, defects: 2106)
考虑的验证功能8163:9796(defects-free:1634, defects: 0)
其余功能仅用于测试。总功能为10885。
我在经过验证的数据集上得到的评估指标是:
'accuracy': 0.97917944,
'accuracy_baseline': 1.0,
'auc': 1.0,
'auc_precision_recall': 0.0,
'average_loss': 0.27983573,
'label/mean': 0.0,
'loss': 35.151672,
'precision': 0.0,
'prediction/mean': 0.22930107,
'recall': 0.0,
'global_step': 332261
由于不平衡数据集,我得到了精度并回想为0。
我的代码附在这里,任何人都可以解决如何解决数据集不平衡问题。否则请指定有关我的代码的原因。
import tensorflow as tf
import numpy as np
import pandas as pd
import os
import shutil
dataset = pd.read_csv('jm_missing_removed.csv')
dataset = dataset.iloc[:,0:22]
CSV_COLUMNS = ['loc','vg','evg','ivg','n','v','l','d','i','e','b','t','lOCode','lOComment','lOBlank','locCodeAndComment','uniq_Op','uniq_Opnd','total_Op','total_Opnd','branchCount','defects'
]
FEATURES = CSV_COLUMNS[0:len(CSV_COLUMNS) - 1]
LABEL = CSV_COLUMNS[21]
def make_feature_cols():
input_columns = [tf.feature_column.numeric_column(k) for k in FEATURES]
return input_columns
feature_columns = make_feature_cols()
feature_columns
tf.logging.set_verbosity(tf.logging.INFO)
# To save the trained model
OUTDIR = './logs/breastCancer_trained'
shutil.rmtree(OUTDIR, ignore_errors = True)
myopt = tf.train.FtrlOptimizer(learning_rate = 0.01)
model = tf.estimator.DNNClassifier(feature_columns = make_feature_cols(),
model_dir = OUTDIR, hidden_units=[10, 10],
n_classes=2, optimizer = myopt,
activation_fn = tf.nn.relu)
def make_input_fn(df, num_epochs):
return tf.estimator.inputs.pandas_input_fn(
x = df,
y = df[LABEL],
num_epochs = num_epochs,
shuffle = True,
num_threads = 1
)
model.train(input_fn = make_input_fn(df_train, num_epochs = 10))
ev = model.evaluate(input_fn = make_input_fn(df_eval, num_epochs = 1))
任何更简单的解决方案将不胜感激。
答案 0 :(得分:0)
使用K-FOLD方法并通过ADASYN进行向上放大将得到更好的结果