GridSearch使用Spark Sklearn抛出multiprocessing.pool.MaybeEncodingError:

时间:2017-12-01 05:19:48

标签: python apache-spark scikit-learn pyspark

我正在尝试使用Spark Sklearn来使用spark sklearn模块并行化gridsearchcv操作。我正在使用具有以下配置的EMR群集

主节点 - c4.4xlarge(30 G,8核/ 16个vCPU) 从节点(3个数字) - c4.8xlarge(60 G,18个核心/ 36个vCPU)

编辑:我无法粘贴代码,但这很简单

from spark_sklearn import GridSearchCV as sp_GridSearchCV
grid_search = sp_GridSearchCV(sc, clf, parameters, n_jobs=-1, cv=cv, scoring='f1_macro',verbose=verbose)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y)
grid_search.fit(X_train, y_train)

更多信息[https://github.com/databricks/spark-sklearn]

我正在运行以下spark-submit

 spark-submit --executor-cores 7 --num-executors 4 --executor-memory 9G --master yarn --deploy-mode client <program_name.py>

数据大小:6000行文本数据,即它。它还不够大。

但我继续在下面的错误中运行,这源于标准sklearn的gridsearch.py​​

之前有没有人遇到过这个错误,或者想知道我做错了什么?

SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/mnt/yarn/usercache/hadoop/filecache/11/__spark_libs__7590692825730242151.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    17/11/30 22:53:46 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 18282@ip-10-17-36-80
    17/11/30 22:53:46 INFO SignalUtils: Registered signal handler for TERM
    17/11/30 22:53:46 INFO SignalUtils: Registered signal handler for HUP
    17/11/30 22:53:46 INFO SignalUtils: Registered signal handler for INT
    17/11/30 22:53:47 INFO SecurityManager: Changing view acls to: yarn,hadoop
    17/11/30 22:53:47 INFO SecurityManager: Changing modify acls to: yarn,hadoop
    17/11/30 22:53:47 INFO SecurityManager: Changing view acls groups to: 
    17/11/30 22:53:47 INFO SecurityManager: Changing modify acls groups to: 
    17/11/30 22:53:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users  with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
    17/11/30 22:53:47 INFO TransportClientFactory: Successfully created connection to /10.17.36.61:36396 after 53 ms (0 ms spent in bootstraps)
    17/11/30 22:53:47 INFO SecurityManager: Changing view acls to: yarn,hadoop
    17/11/30 22:53:47 INFO SecurityManager: Changing modify acls to: yarn,hadoop
    17/11/30 22:53:47 INFO SecurityManager: Changing view acls groups to: 
    17/11/30 22:53:47 INFO SecurityManager: Changing modify acls groups to: 
    17/11/30 22:53:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users  with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
    17/11/30 22:53:47 INFO TransportClientFactory: Successfully created connection to /10.17.36.61:36396 after 0 ms (0 ms spent in bootstraps)
    17/11/30 22:53:47 INFO DiskBlockManager: Created local directory at /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/blockmgr-9390d584-49ab-407b-ab0d-c98f5cae1bfe
    17/11/30 22:53:47 INFO MemoryStore: MemoryStore started with capacity 7.5 GB
    17/11/30 22:53:47 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.17.36.61:36396
    17/11/30 22:53:47 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
    17/11/30 22:53:47 INFO Executor: Starting executor ID 1 on host ip-10-17-36-80.us-west-2.compute.internal
    17/11/30 22:53:48 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44955.
    17/11/30 22:53:48 INFO NettyBlockTransferService: Server created on ip-10-17-36-80.us-west-2.compute.internal:44955
    17/11/30 22:53:48 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
    17/11/30 22:53:48 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, ip-10-17-36-80.us-west-2.compute.internal, 44955, None)
    17/11/30 22:53:48 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, ip-10-17-36-80.us-west-2.compute.internal, 44955, None)
    17/11/30 22:53:48 INFO BlockManager: external shuffle service port = 7337
    17/11/30 22:53:48 INFO BlockManager: Registering executor with local external shuffle service.
    17/11/30 22:53:48 INFO TransportClientFactory: Successfully created connection to ip-10-17-36-80.us-west-2.compute.internal/10.17.36.80:7337 after 0 ms (0 ms spent in bootstraps)
    17/11/30 22:53:48 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, ip-10-17-36-80.us-west-2.compute.internal, 44955, None)
    17/11/30 22:53:50 INFO CoarseGrainedExecutorBackend: Got assigned task 5
    17/11/30 22:53:50 INFO CoarseGrainedExecutorBackend: Got assigned task 14
    17/11/30 22:53:50 INFO CoarseGrainedExecutorBackend: Got assigned task 23
    17/11/30 22:53:50 INFO CoarseGrainedExecutorBackend: Got assigned task 32
    17/11/30 22:53:50 INFO CoarseGrainedExecutorBackend: Got assigned task 41
    17/11/30 22:53:50 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
    17/11/30 22:53:50 INFO Executor: Running task 23.0 in stage 0.0 (TID 23)
    17/11/30 22:53:50 INFO Executor: Running task 14.0 in stage 0.0 (TID 14)
    17/11/30 22:53:50 INFO Executor: Running task 32.0 in stage 0.0 (TID 32)
    17/11/30 22:53:50 INFO Executor: Running task 41.0 in stage 0.0 (TID 41)
    17/11/30 22:53:50 INFO Executor: Fetching spark://10.17.36.61:36396/files/main.py with timestamp 1512082429416
    17/11/30 22:53:50 INFO TransportClientFactory: Successfully created connection to /10.17.36.61:36396 after 1 ms (0 ms spent in bootstraps)
    17/11/30 22:53:50 INFO Utils: Fetching spark://10.17.36.61:36396/files/main.py to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/fetchFileTemp5747466536720143705.tmp
    17/11/30 22:53:50 INFO Utils: Copying /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/-7964204541512082429416_cache to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/./main.py
    17/11/30 22:53:50 INFO Executor: Fetching spark://10.17.36.61:36396/files/config.py with timestamp 1512082429407
    17/11/30 22:53:50 INFO Utils: Fetching spark://10.17.36.61:36396/files/config.py to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/fetchFileTemp1456561651760694351.tmp
    17/11/30 22:53:50 INFO Utils: Copying /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/-13639922071512082429407_cache to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/./config.py
    17/11/30 22:53:50 INFO Executor: Fetching spark://10.17.36.61:36396/files/helper.py with timestamp 1512082429375
    17/11/30 22:53:50 INFO Utils: Fetching spark://10.17.36.61:36396/files/helper.py to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/fetchFileTemp3268157159972473452.tmp
    17/11/30 22:53:50 INFO Utils: Copying /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/7433209651512082429375_cache to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/./helper.py
    17/11/30 22:53:50 INFO Executor: Fetching spark://10.17.36.61:36396/files/transformers.py with timestamp 1512082429412
    17/11/30 22:53:50 INFO Utils: Fetching spark://10.17.36.61:36396/files/transformers.py to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/fetchFileTemp5944429394790398969.tmp
    17/11/30 22:53:50 INFO Utils: Copying /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/-3422273991512082429412_cache to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/./transformers.py
    17/11/30 22:53:50 INFO TorrentBroadcast: Started reading broadcast variable 2
    17/11/30 22:53:50 INFO TransportClientFactory: Successfully created connection to /10.17.36.61:42897 after 1 ms (0 ms spent in bootstraps)
    17/11/30 22:53:50 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 9.9 KB, free 7.5 GB)
    17/11/30 22:53:50 INFO TorrentBroadcast: Reading broadcast variable 2 took 99 ms
    17/11/30 22:53:50 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 14.3 KB, free 7.5 GB)
    17/11/30 22:53:51 INFO TorrentBroadcast: Started reading broadcast variable 0
    17/11/30 22:53:51 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 226.1 KB, free 7.5 GB)
    17/11/30 22:53:51 INFO TorrentBroadcast: Reading broadcast variable 0 took 9 ms
    17/11/30 22:53:51 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 368.0 B, free 7.5 GB)
    17/11/30 22:53:51 INFO TorrentBroadcast: Started reading broadcast variable 1
    17/11/30 22:53:51 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 6.7 KB, free 7.5 GB)
    17/11/30 22:53:51 INFO TorrentBroadcast: Reading broadcast variable 1 took 6 ms
    17/11/30 22:53:51 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 368.0 B, free 7.5 GB)
    /usr/local/lib64/python3.5/site-packages/sklearn/linear_model/logistic.py:1228: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = -1.
      " = {}.".format(self.n_jobs))
    17/11/30 23:35:37 ERROR Executor: Exception in task 32.0 in stage 0.0 (TID 32)
    org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "/mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/pyspark.zip/pyspark/worker.py", line 177, in main
        process()
      File "/mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/pyspark.zip/pyspark/worker.py", line 172, in process
        serializer.dump_stream(func(split_index, iterator), outfile)
      File "/mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
        vs = list(itertools.islice(iterator, batch))
      File "/usr/local/lib/python3.5/site-packages/spark_sklearn/grid_search.py", line 319, in fun
        return_parameters=True, error_score=error_score)
      File "/usr/local/lib64/python3.5/site-packages/sklearn/model_selection/_validation.py", line 437, in _fit_and_score
        estimator.fit(X_train, y_train, **fit_params)
      File "/usr/local/lib64/python3.5/site-packages/sklearn/pipeline.py", line 257, in fit
        Xt, fit_params = self._fit(X, y, **fit_params)
      File "/usr/local/lib64/python3.5/site-packages/sklearn/pipeline.py", line 222, in _fit
        **fit_params_steps[name])
      File "/usr/local/lib64/python3.5/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
        return self.func(*args, **kwargs)
      File "/usr/local/lib64/python3.5/site-packages/sklearn/pipeline.py", line 589, in _fit_transform_one
        res = transformer.fit_transform(X, y, **fit_params)
      File "/usr/local/lib64/python3.5/site-packages/sklearn/pipeline.py", line 746, in fit_transform
        for name, trans, weight in self._iter())
      File "/usr/local/lib64/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 789, in __call__
        self.retrieve()
      File "/usr/local/lib64/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
        self._output.extend(job.get(timeout=self.timeout))
      File "/usr/lib64/python3.5/multiprocessing/pool.py", line 608, in get
        raise self._value
    multiprocessing.pool.MaybeEncodingError: Error sending result: '[(array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           ..., 
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.]]), Pipeline(memory=None,
         steps=[('features', CustomFeatures()), ('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
            dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
            lowercase=True, max_df=1.0, max_features=None, min_df=1,
            ngram_range=(1, 2), norm='l2', prepr...ac3c18>), ('kbest', SelectPercentile(percentile=100, score_func=<function chi2 at 0x7f38fbf23f28>))]))]'. Reason: 'MemoryError()'

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

1 个答案:

答案 0 :(得分:0)

好的,上述问题的解决方案是减少数据集的维度。我在TFIDF步骤中使用了max_df和min_df来实现这一点。现在效果很好。