我从http://go.databricks.com/hubfs/notebooks/Samples/Miscellaneous/blog_post_cv.html复制了spark-sklearn GridSearch示例,并使用spark-sklearn python包在分布式环境中使用spark运行GridSearch。我的python版本是2.7,spark 2.1.0
我使用--master local spark-submit --master local XGBoostGridSearch.py
在spark中运行它。代码运行成功。
当我使用--master yarn spark-submit --master yarn --py-files spark_sklearn.zip TestGridSearch.py
在spark中运行相同的代码时,代码失败并显示以下错误消息
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
return pickle.loads(obj)
ImportError: No module named sklearn.ensemble.forest
模块sklearn.ensemble.forest和所有其他python依赖项已经通过Anaconda2安装在所有执行程序上,但仍然收到上述错误消息。以下是我的代码
from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
#gs = grid_search.GridSearchCV(RandomForestClassifier(), param_grid=param_grid)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SciKitLearn-GridSearch")\
.config("spark.network.timeout","10000s")\
.config("spark.rpc.askTimeout","10000s").getOrCreate()
spark.sparkContext.addPyFile("spark_sklearn.zip")
from spark_sklearn import GridSearchCV
gs = GridSearchCV(spark.sparkContext, RandomForestClassifier(), param_grid)
gs.fit(X, y)
注意:TestGridSearch.py是我的代码所在的位置。
感到困惑,看它在火花局部模式下运行,但在纱线(分布式)模式下不运行。非常感谢任何帮助。