Question

我用TF示例代码训练了梯度提升分类器 https://www.tensorflow.org/tutorials/estimators/boosted_trees_model_understanding

但是，训练时TF估计器梯度提升分类器突然停止

我认为乞讨要花几个步骤，然后突然停下来，没有任何异常打印

我如何获得python崩溃的任何原因

很难找到停止的原因

res：

lib : TF-gpu 1.13.1
cuda : 10.0
cudnn : 7.5

日志：

2019-04-15 16：40：26.175889：我 tensorflow / core / common_runtime / gpu / gpu_device.cc：1433]找到设备0 具有属性：名称：GeForce GTX 1060 6GB主音：6小音：1 memoryClockRate（GHz）：1.7845 pciBusID：0000：07：00.0 totalMemory： 6.00GiB freeMemory：4.97GiB 2019-04-15 16：40：26.182620：我tensorflow / core / common_runtime / gpu / gpu_device.cc：1512]添加可见 gpu设备：0 2019-04-15 16：40：26.832040：I tensorflow / core / common_runtime / gpu / gpu_device.cc：984]设备将StreamExecutor与强度1边缘矩阵互连：2019-04-15 16：40：26.835620：我 tensorflow / core / common_runtime / gpu / gpu_device.cc：990] 0 2019-04-15 16：40：26.836840：我 tensorflow / core / common_runtime / gpu / gpu_device.cc：1003] 0：N 2019-04-15 16：40：26.838276：我 tensorflow / core / common_runtime / gpu / gpu_device.cc：1115]已创建 TensorFlow设备（/ job：localhost /副本：0 /任务：0 /设备：GPU：0与 4716 MB内存）->物理GPU（设备：0，名称：GeForce GTX 1060 6GB，PCI总线ID：0000：07：00.0，计算能力：6.1）警告：tensorflow：从 D：\ python \ lib \ site-packages \ tensorflow \ python \ training \ saver.py：1266： checkpoint_exists（来自 tensorflow.python.training.checkpoint_management）已弃用，将在以后的版本中删除。更新说明：使用标准文件API来检查带有此前缀的文件。警告：tensorflow：从 D：\ python \ lib \ site-packages \ tensorflow \ python \ training \ saver.py：1070： get_checkpoint_mtimes（来自 tensorflow.python.training.checkpoint_management）已弃用，将在以后的版本中删除。更新说明：使用标准文件实用程序来获取mtimes。警告：tensorflow：问题序列化资源时遇到。类型不受支持，或者项目类型与CollectionDef中的字段类型不匹配。注意这个是一个警告，很可能会被忽略。 '_Resource'对象没有属性“名称”警告：tensorflow：序列化时遇到的问题资源。类型不受支持，或项目类型不匹配 CollectionDef中的字段类型。请注意，这是一个警告，可能很安全忽视。 '_Resource'对象没有属性'name'

D：\ py>（刚刚完成培训）

trn = pd.read_csv('data/santander-customer-transaction-prediction/train.csv')
        tst = pd.read_csv('data/santander-customer-transaction-prediction/test.csv')



    #trn = upsample(trn[trn.target==0], trn[trn.target==1])
#   trn = downsample(trn[trn.target==0], trn[trn.target==1])


    features = trn.columns.values[2:202]
    target_name = trn.columns.values[1]
    train=trn[features]
    target=trn[target_name]

    NUM_EXAMPLES = len (target)
    print (NUM_EXAMPLES)

    feat1 = train.corrwith(target).sort_values().head(20).index
    feat2 = train.corrwith(target).sort_values().tail(20).index
    featonly = feat1.append(feat2)
    feat = featonly.append(pd.Index(['target']))

    train_origin, tt = train_test_split(trn, test_size=0.2)

    train = train_origin[featonly]
    target = train_origin[target_name]
    test = tst[featonly]

    target_name_tst = tst.columns.values[1]
    target_tst=tst[target_name_tst]

    val_origin=tt
    val_train = tt[featonly]
    val_target = tt[target_name]
    # Training and evaluation input functions.

    train_input_fn = make_input_fn(train, target)
    val_input_fn = make_input_fn(val_train, val_target)

    ttt=tf.estimator.inputs.pandas_input_fn(x=test,num_epochs=1,shuffle=False)


    del train,target,val_train,train_origin,trn,tst

    fc = tf.feature_column
    feature_columns = []
    for feature_name in featonly:
        feature_columns.append(fc.numeric_column(feature_name,dtype=tf.float32))
    #feature_columns



    #5
    #tf.logging.set_verbosity(tf.logging.INFO)
    #logging_hook = tf.train.LoggingTensorHook({"loss" : loss, "accuracy" : accuracy}, every_n_iter=10)

    params = {
      'n_trees': 50,
      'max_depth': 3,
      'n_batches_per_layer': 1,
      # You must enable center_bias = True to get DFCs. This will force the model to 
      # make an initial prediction before using any features (e.g. use the mean of 
      # the training labels for regression or log odds for classification when
      # using cross entropy loss).
      'center_bias': True
    }
#   config = tf.estimator.RunConfig().replace(keep_checkpoint_max = 1, 
 #                   log_step_count_steps=20, save_checkpoints_steps=20)

    est = tf.estimator.BoostedTreesClassifier(feature_columns, **params,model_dir='d:\py/model/')
    est.train(train_input_fn, max_steps=50)

-------------------------------------------已停止

metrics = est.evaluate(input_fn=val_input_fn,steps=1)

    results = est.predict(input_fn=ttt )
    result_list = list(results)


    classi = list(map(lambda x : x['classes'][0].decode("utf-8"), result_list))
    num = list(range(0,len(classi)))
    numi = list(map(lambda x : 'test_' + str(x),num))
    #df1 = pd.DataFrame(columns=('ID_code','target'))

    df_result = pd.DataFrame({'ID_code' : numi, 'target' : classi})

    df_result.to_csv('result/submission03.csv',index=False)

def make_input_fn(X, y, n_epochs=None, shuffle=True):
def input_fn():
    NUM_EXAMPLES = len(y)
    dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
  #  dataset = tf.data.Dataset.from_tensor_slices((X.to_dict(orient='list'), y))
    #if shuffle:
     #   dataset = dataset.shuffle(NUM_EXAMPLES)
    # For training, cycle thru dataset as many times as need (n_epochs=None).    
    dataset = (dataset.repeat(n_epochs).batch(NUM_EXAMPLES)) 
    return dataset
return input_fn

应显示评估结果

Answer 1

我认为问题是由GPU内存溢出引起的。您可以尝试根据您的GPU内存大小将“ n_batches_per_layer”的值修改为更大的值。我使用6G GPU，该值为16。

训练时TF估算器梯度提升分类器突然停止

1 个答案: