我正在尝试使用Spark Sklearn来使用spark sklearn模块并行化gridsearchcv操作。我正在使用具有以下配置的EMR群集
主节点 - c4.4xlarge(30 G,8核/ 16个vCPU) 从节点(3个数字) - c4.8xlarge(60 G,18个核心/ 36个vCPU)
编辑:我无法粘贴代码,但这很简单
from spark_sklearn import GridSearchCV as sp_GridSearchCV
grid_search = sp_GridSearchCV(sc, clf, parameters, n_jobs=-1, cv=cv, scoring='f1_macro',verbose=verbose)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y)
grid_search.fit(X_train, y_train)
更多信息[https://github.com/databricks/spark-sklearn]
我正在运行以下spark-submit
spark-submit --executor-cores 7 --num-executors 4 --executor-memory 9G --master yarn --deploy-mode client <program_name.py>
数据大小:6000行文本数据,即它。它还不够大。
但我继续在下面的错误中运行,这源于标准sklearn的gridsearch.py
之前有没有人遇到过这个错误,或者想知道我做错了什么?
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/yarn/usercache/hadoop/filecache/11/__spark_libs__7590692825730242151.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/11/30 22:53:46 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 18282@ip-10-17-36-80
17/11/30 22:53:46 INFO SignalUtils: Registered signal handler for TERM
17/11/30 22:53:46 INFO SignalUtils: Registered signal handler for HUP
17/11/30 22:53:46 INFO SignalUtils: Registered signal handler for INT
17/11/30 22:53:47 INFO SecurityManager: Changing view acls to: yarn,hadoop
17/11/30 22:53:47 INFO SecurityManager: Changing modify acls to: yarn,hadoop
17/11/30 22:53:47 INFO SecurityManager: Changing view acls groups to:
17/11/30 22:53:47 INFO SecurityManager: Changing modify acls groups to:
17/11/30 22:53:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
17/11/30 22:53:47 INFO TransportClientFactory: Successfully created connection to /10.17.36.61:36396 after 53 ms (0 ms spent in bootstraps)
17/11/30 22:53:47 INFO SecurityManager: Changing view acls to: yarn,hadoop
17/11/30 22:53:47 INFO SecurityManager: Changing modify acls to: yarn,hadoop
17/11/30 22:53:47 INFO SecurityManager: Changing view acls groups to:
17/11/30 22:53:47 INFO SecurityManager: Changing modify acls groups to:
17/11/30 22:53:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
17/11/30 22:53:47 INFO TransportClientFactory: Successfully created connection to /10.17.36.61:36396 after 0 ms (0 ms spent in bootstraps)
17/11/30 22:53:47 INFO DiskBlockManager: Created local directory at /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/blockmgr-9390d584-49ab-407b-ab0d-c98f5cae1bfe
17/11/30 22:53:47 INFO MemoryStore: MemoryStore started with capacity 7.5 GB
17/11/30 22:53:47 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.17.36.61:36396
17/11/30 22:53:47 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
17/11/30 22:53:47 INFO Executor: Starting executor ID 1 on host ip-10-17-36-80.us-west-2.compute.internal
17/11/30 22:53:48 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44955.
17/11/30 22:53:48 INFO NettyBlockTransferService: Server created on ip-10-17-36-80.us-west-2.compute.internal:44955
17/11/30 22:53:48 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/11/30 22:53:48 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, ip-10-17-36-80.us-west-2.compute.internal, 44955, None)
17/11/30 22:53:48 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, ip-10-17-36-80.us-west-2.compute.internal, 44955, None)
17/11/30 22:53:48 INFO BlockManager: external shuffle service port = 7337
17/11/30 22:53:48 INFO BlockManager: Registering executor with local external shuffle service.
17/11/30 22:53:48 INFO TransportClientFactory: Successfully created connection to ip-10-17-36-80.us-west-2.compute.internal/10.17.36.80:7337 after 0 ms (0 ms spent in bootstraps)
17/11/30 22:53:48 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, ip-10-17-36-80.us-west-2.compute.internal, 44955, None)
17/11/30 22:53:50 INFO CoarseGrainedExecutorBackend: Got assigned task 5
17/11/30 22:53:50 INFO CoarseGrainedExecutorBackend: Got assigned task 14
17/11/30 22:53:50 INFO CoarseGrainedExecutorBackend: Got assigned task 23
17/11/30 22:53:50 INFO CoarseGrainedExecutorBackend: Got assigned task 32
17/11/30 22:53:50 INFO CoarseGrainedExecutorBackend: Got assigned task 41
17/11/30 22:53:50 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
17/11/30 22:53:50 INFO Executor: Running task 23.0 in stage 0.0 (TID 23)
17/11/30 22:53:50 INFO Executor: Running task 14.0 in stage 0.0 (TID 14)
17/11/30 22:53:50 INFO Executor: Running task 32.0 in stage 0.0 (TID 32)
17/11/30 22:53:50 INFO Executor: Running task 41.0 in stage 0.0 (TID 41)
17/11/30 22:53:50 INFO Executor: Fetching spark://10.17.36.61:36396/files/main.py with timestamp 1512082429416
17/11/30 22:53:50 INFO TransportClientFactory: Successfully created connection to /10.17.36.61:36396 after 1 ms (0 ms spent in bootstraps)
17/11/30 22:53:50 INFO Utils: Fetching spark://10.17.36.61:36396/files/main.py to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/fetchFileTemp5747466536720143705.tmp
17/11/30 22:53:50 INFO Utils: Copying /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/-7964204541512082429416_cache to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/./main.py
17/11/30 22:53:50 INFO Executor: Fetching spark://10.17.36.61:36396/files/config.py with timestamp 1512082429407
17/11/30 22:53:50 INFO Utils: Fetching spark://10.17.36.61:36396/files/config.py to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/fetchFileTemp1456561651760694351.tmp
17/11/30 22:53:50 INFO Utils: Copying /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/-13639922071512082429407_cache to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/./config.py
17/11/30 22:53:50 INFO Executor: Fetching spark://10.17.36.61:36396/files/helper.py with timestamp 1512082429375
17/11/30 22:53:50 INFO Utils: Fetching spark://10.17.36.61:36396/files/helper.py to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/fetchFileTemp3268157159972473452.tmp
17/11/30 22:53:50 INFO Utils: Copying /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/7433209651512082429375_cache to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/./helper.py
17/11/30 22:53:50 INFO Executor: Fetching spark://10.17.36.61:36396/files/transformers.py with timestamp 1512082429412
17/11/30 22:53:50 INFO Utils: Fetching spark://10.17.36.61:36396/files/transformers.py to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/fetchFileTemp5944429394790398969.tmp
17/11/30 22:53:50 INFO Utils: Copying /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/spark-2016730f-a9a7-42f6-af43-e214e17a799c/-3422273991512082429412_cache to /mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/./transformers.py
17/11/30 22:53:50 INFO TorrentBroadcast: Started reading broadcast variable 2
17/11/30 22:53:50 INFO TransportClientFactory: Successfully created connection to /10.17.36.61:42897 after 1 ms (0 ms spent in bootstraps)
17/11/30 22:53:50 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 9.9 KB, free 7.5 GB)
17/11/30 22:53:50 INFO TorrentBroadcast: Reading broadcast variable 2 took 99 ms
17/11/30 22:53:50 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 14.3 KB, free 7.5 GB)
17/11/30 22:53:51 INFO TorrentBroadcast: Started reading broadcast variable 0
17/11/30 22:53:51 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 226.1 KB, free 7.5 GB)
17/11/30 22:53:51 INFO TorrentBroadcast: Reading broadcast variable 0 took 9 ms
17/11/30 22:53:51 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 368.0 B, free 7.5 GB)
17/11/30 22:53:51 INFO TorrentBroadcast: Started reading broadcast variable 1
17/11/30 22:53:51 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 6.7 KB, free 7.5 GB)
17/11/30 22:53:51 INFO TorrentBroadcast: Reading broadcast variable 1 took 6 ms
17/11/30 22:53:51 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 368.0 B, free 7.5 GB)
/usr/local/lib64/python3.5/site-packages/sklearn/linear_model/logistic.py:1228: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = -1.
" = {}.".format(self.n_jobs))
17/11/30 23:35:37 ERROR Executor: Exception in task 32.0 in stage 0.0 (TID 32)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1512081909079_0001/container_1512081909079_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/lib/python3.5/site-packages/spark_sklearn/grid_search.py", line 319, in fun
return_parameters=True, error_score=error_score)
File "/usr/local/lib64/python3.5/site-packages/sklearn/model_selection/_validation.py", line 437, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib64/python3.5/site-packages/sklearn/pipeline.py", line 257, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "/usr/local/lib64/python3.5/site-packages/sklearn/pipeline.py", line 222, in _fit
**fit_params_steps[name])
File "/usr/local/lib64/python3.5/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
return self.func(*args, **kwargs)
File "/usr/local/lib64/python3.5/site-packages/sklearn/pipeline.py", line 589, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/usr/local/lib64/python3.5/site-packages/sklearn/pipeline.py", line 746, in fit_transform
for name, trans, weight in self._iter())
File "/usr/local/lib64/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 789, in __call__
self.retrieve()
File "/usr/local/lib64/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/usr/lib64/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[(array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]]), Pipeline(memory=None,
steps=[('features', CustomFeatures()), ('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 2), norm='l2', prepr...ac3c18>), ('kbest', SelectPercentile(percentile=100, score_func=<function chi2 at 0x7f38fbf23f28>))]))]'. Reason: 'MemoryError()'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
答案 0 :(得分:0)
好的,上述问题的解决方案是减少数据集的维度。我在TFIDF步骤中使用了max_df和min_df来实现这一点。现在效果很好。