我正在Spark集群上使用Keras(在深层神经网络上执行网格搜索)运行python程序,我想确保结果可重复。
通过运行以下.sh
脚本来点燃我的Spark集群:
SPARK_HOME=/u/users/******/spark-2.3.0 \
Q_CORE_LOC=/u/users/******/q-core \
ENV=local \
HIVE_HOME=/usr/hdp/current/hive-client \
SPARK2_HOME=/u/users/******/spark-2.3.0 \
HADOOP_CONF_DIR=/etc/hadoop/conf \
HIVE_CONF_DIR=/etc/hive/conf \
HDFS_PREFIX=hdfs:// \
PYTHONPATH=/u/users/******/q-core/python-lib:/u/users/******/three-queues/python-lib:/u/users/******/pyenv/prod_python_libs/lib/python2.7/site-packages/:$PYTHON_PATH \
YARN_HOME=/usr/hdp/current/hadoop-yarn-client \
SPARK_DIST_CLASSPATH=$(hadoop classpath):$(yarn classpath):/etc/hive/conf/hive-site.xml \
PYSPARK_PYTHON=/usr/bin/python2.7 \
QQQ_LOC=/u/users/******/three-queues \
spark-submit \
--master yarn 'dnn_grid_search.py' \
--executor-memory 10g \
--num-executors 8 \
--executor-cores 10 \
--conf spark.port.maxRetries=80 \
--conf spark.dynamicAllocation.enabled=False \
--conf spark.default.parallelism=6000 \
--conf spark.sql.shuffle.partitions=6000 \
--principal ************************ \
--queue default \
--name lets_get_starting \
--keytab /u/users/******/.******.keytab \
--driver-memory 10g
通过在nohup ./spark_python_shell.sh > output.log &
外壳程序上运行bash
,点燃Spark集群,并同时运行python-Keras脚本(请参见上文spark-submit \ --master yarn 'dnn_grid_search.py'
)。
为了确保结果的可重复性,我尝试在便携式计算机的CPU上成功完成操作,以获得可重复的结果(另请参见我在StackOverflow上的答案:to documentation,answer_1):
# Seed value
# Apparently you may use different seed values at each stage
seed_value= 0
# 1. Set `PYTHONHASHSEED` environment variable at a fixed value
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
# 2. Set `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)
# 3. Set `numpy` pseudo-random generator at a fixed value
import numpy as np
np.random.seed(seed_value)
# 4. Set `tensorflow` pseudo-random generator at a fixed value
import tensorflow as tf
tf.set_random_seed(seed_value)
# 5. Configure a new global `tensorflow` session
from keras import backend as K
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)
但是,通过包含(5)
部分,我可以肯定程序无法正常运行,因为其中包含的某些print
语句不会在output.log
文件中打印任何内容
具体来说,我创建了一个非常简单的自定义计分器函数,以向其中插入此打印语句,以便查看网格搜索是否真正在运行:
def precision_recall(y_true, y_predict):
# Print time
from datetime import datetime
time_now = datetime.utcnow()
print('Grid Search time now:', time_now)
from sklearn.metrics import precision_recall_fscore_support
_, recall, _, _ = precision_recall_fscore_support(y_true, y_predict, average='binary')
return recall
from sklearn.metrics import make_scorer
custom_scorer = make_scorer(score_func=precision_recall, greater_is_better=True, needs_proba=False)
# Execute grid search
# Notice 'n_jobs=-1' for parallelizing the jobs (across the cluster)
from sklearn.model_selection import GridSearchCV
classifiers_grid = GridSearchCV(estimator=classifier, param_grid=parameters, scoring=custom_scorer, cv=5, n_jobs=-1)
classifiers_grid.fit(X, y)
但是,当自定义分数函数(precision_recall
)中的print语句正确地打印了我运行其他ML算法(随机森林等)的时间时,然后当我使用Keras / Tensorflow及其种子系统进行尝试时和会话,然后什么也没打印。
因此,我的问题是如何在Spark上使用Keras / Tensorflow获得可再现的结果?