我正在运行提交 ml引擎培训的tensorflow模型。我已经构建了一个管道,它使用 tf.contrib.cloud.python.ops.bigquery_reader_ops.BigQueryReade 作为读者,从 BigQuery 读取的队列即可。
使用 DataLab 和本地,一切正常,设置 GOOGLE_APPLICATION_CREDENTIALS 变量指向凭据密钥的json文件。但是,当我在云端提交培训工作时,我会收到这些错误(我只发布了两个主要错误):
权限被拒绝:在阅读...的架构时执行HTTP请求时出错(HTTP响应代码403,错误代码0,错误消息'')
创建模型时出错。检查详细信息:请求的身份验证范围不足。
我已经检查了其他所有内容,比如在脚本和project / dataset / table ids / names中正确定义表模式
我在这里粘贴了日志中存在的整个错误,以便更清晰:
消息:“Traceback(最近一次呼叫最后一次):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 131, in <module>
hparams=hparam.HParams(**args.__dict__)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run
return _execute_schedule(experiment, schedule)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule
return task()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate
self.train(delay_secs=0)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
hooks=self._train_monitors + extra_hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
monitors=hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1007, in _train_model
_, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 521, in __exit__
self._close_internal(exception_type)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 556, in _close_internal
self._sess.close()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 791, in close
self._sess.close()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 888, in close
ignore_live_threads=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1063, in _single_operation_run
target_list_as_strings, status, None)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
PermissionDeniedError: Error executing an HTTP request (HTTP response code 403, error code 0, error message '')
when reading schema for pasquinelli-bigdata:Transactions.t_11_Hotel_25_w_train@1505224768418
[[Node: GenerateBigQueryReaderPartitions = GenerateBigQueryReaderPartitions[columns=["F_RACC_GEST", "LABEL", "F_RCA", "W24", "ETA", "W22", "W23", "W20", "W21", "F_LEASING", "W2", "W16", "WLABEL", "SEX", "F_PIVA", "F_MUTUO", "Id_client", "F_ASS_VITA", "F_ASS_DANNI", "W19", "W18", "W17", "PROV", "W15", "W14", "W13", "W12", "W11", "W10", "W7", "W6", "W5", "W4", "W3", "F_FIN", "W1", "ImpTot", "F_MULTIB", "W9", "W8"], dataset_id="Transactions", num_partitions=1, project_id="pasquinelli-bigdata", table_id="t_11_Hotel_25_w_train", test_end_point="", timestamp_millis=1505224768418, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
任何建议都非常有用,因为我对GC比较新。 谢谢大家。
答案 0 :(得分:0)
支持从Cloud ML Engine读取BigQuery数据仍在开发中,因此您目前所做的工作不受支持。你遇到的问题是ML Engine运行的机器没有与BigQuery交流的合适范围。您可能在本地运行时遇到的潜在问题是从BigQuery读取性能不佳。这是需要解决的两个工作范例。
与此同时,我建议将数据导出到GCS进行培训。这将更具可扩展性,因此您不必担心随着数据的增加而导致培训效果不佳。这可以是一个很好的模式,它可以让你对数据进行一次预处理,以CSV格式将结果写入GCS,然后进行多次训练以尝试不同的算法或超参数。