Question

我写了一个sparkR代码，想知道我是否可以在EMR集群上使用spark-submit或sparkR提交它。

我尝试了几种方法，例如： sparkR mySparkRScript.r或sparkR --no-save mySparkScript.r等..但每次我都收到以下错误：

Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  : 
JVM is not ready after 10 seconds

示例代码：

#Set the path for the R libraries you would like to use.
#You may need to modify this if you have custom R libraries.
.libPaths(c(.libPaths(), '/usr/lib/spark/R/lib'))

#Set the SPARK_HOME environment variable to the location on EMR
Sys.setenv(SPARK_HOME = '/usr/lib/spark')

#Load the SparkR library into R
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

#Initiate a Spark context and identify where the master node is located.
#local is used here because the RStudio server
#was installed on the master node

sc <- sparkR.session(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g"))

sqlContext <- sparkRSQL.init(sc)

注意：我可以通过直接粘贴或使用source("mySparkRScript.R")在sparkr-shell中运行我的代码。

参考：

Answer 1

我能够通过Rscript运行。您需要做一些事情，这可能有点过程密集。如果你愿意试一试，我会建议：

了解如何进行自动SparkR或sparklyR构建。途经：https://github.com/UrbanInstitute/spark-social-science
使用AWS CLI首先创建一个包含您将通过以下步骤1创建的EMR模板和引导程序脚本的集群。（确保将EMR模板和rstudio_sparkr_emrlyr_blah_blah.sh sripts放入S3存储桶中）
将您的R代码放入一个文件并将其放入另一个S3存储桶中...您提供的示例代码可以正常工作，但我建议您实际执行一些操作，比如从S3读取数据，添加一个对它有价值，然后把它写回来（只是为了确认它在进入你可能已经坐过的那么重的代码之前有效）
创建另一个.sh文件，将R文件从您拥有的S3存储桶复制到集群，然后通过Rscript执行。将此shell脚本放在与R代码文件相同的S3存储桶中（为简单起见）。此shell文件的内容示例如下所示：
```
#!/bin/bash
aws s3 cp s3://path/to/the/R/file/from/step3.R theNameOfTheFileToRun.R
Rscript theNameOfTheFileToRun.R
```
在AWS CLI中，在创建群集时，将--step插入群集创建调用，使用Amazon提供的CUSTOM JAR RUNNER运行复制并执行R的shell脚本代码
确保在R代码结束时停止Spark会话。

AWS CLI命令的示例可能如下所示（我在我的示例中使用Amazon上的us-east-1区域，并在群集中的每个worker上抛出100GB磁盘...您所在的区域＆＃39; us-east-1＆＃39;并选择您想要的任何大小的磁盘）

aws emr create-cluster --name "MY COOL SPARKR OR SPARKLYR CLUSTER WITH AN RSCRIPT TO RUN SOME R CODE" --release-label emr-5.8.0 --applications Name=Spark Name=Ganglia Name=Hadoop --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.2xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=1}]}' --log-uri s3://path/to/EMR/sparkr_logs --bootstrap-action Path=s3://path/to/EMR/sparkr_bootstrap/rstudio_sparkr_emr5lyr-proc.sh,Args=['--user','cool_dude','--user-pw','top_secret','--shiny','true','--sparkr','true','sparklyr','true'] --ec2-attributes KeyName=mykeyfilename,InstanceProfile=EMR_EC2_DefaultRole,AdditionalMasterSecurityGroups="sg-abc123",SubnetId="subnet-abc123" --service-role EMR_DefaultRole --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --auto-terminate --region us-east-1 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://path/to/the/shell/file/from/step4.sh"]

如何在EMR集群上使用spark-submit或sparkR运行SparkR脚本？

1 个答案: