火花可以与纱线管理的hadoop集群上使用的火花一起使用吗?

时间:2016-06-29 14:42:20

标签: r apache-spark yarn sparkapi sparklyr

sparklyr R package是否能够连接到YARN管理的hadoop集群?这似乎没有记录在cluster deployment文档中。使用Spark附带的SparkR包可以执行以下操作:

# set R environment variables
Sys.setenv(YARN_CONF_DIR=...)
Sys.setenv(SPARK_CONF_DIR=...)
Sys.setenv(LD_LIBRARY_PATH=...)
Sys.setenv(SPARKR_SUBMIT_ARGS=...)

spark_lib_dir <- ... # install specific
library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths()))
sc <- sparkR.init(master = "yarn-client")

然而,当我用

转换上面的最后一行时
library(sparklyr)
sc <- spark_connect(master = "yarn-client")

我收到错误:

Error in start_shell(scon, list(), jars, packages) : 
  Failed to launch Spark shell. Ports file does not exist.
    Path: /usr/hdp/2.4.2.0-258/spark/bin/spark-submit
    Parameters: '--packages' 'com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34' '--jars' '<path to R lib>/3.2/sparklyr/java/rspark_utils.jar'  sparkr-shell /tmp/RtmpT31OQT/filecfb07d7f8bfd.out

Ivy Default Cache set to: /home/mpollock/.ivy2/cache
The jars for the packages stored in: /home/mpollock/.ivy2/jars
:: loading settings :: url = jar:file:<path to spark install>/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
:: resolution report :: resolve 480ms :: artifacts dl 0ms
    :: modules in use:
    -----------------------------------------

sparklyrSparkR的替代方案还是构建在SparkR包之上?

4 个答案:

答案 0 :(得分:5)

是的,sparklyr可以用于纱线管理的群集。为了连接到纱线管理的集群,需要:

  1. 将SPARK_HOME环境变量设置为指向右侧spark主目录。
  2. 使用适当的主位置连接到spark群集,例如:sc <- spark_connect(master = "yarn-client")
  3. 另请参阅:http://spark.rstudio.com/deployment.html

答案 1 :(得分:2)

是的它可以,但是其他所有内容都有一个问题,这在博客文献中是非常难以捉摸的,并且以配置资源为中心。

关键是这个:当你在本地模式下执行它时,你不必配置资源声明性地,但是当你在YARN集群中执行时,你绝对必须申报那些资源。我花了很长时间才找到能够解释这个问题的文章,但是一旦我尝试了它,就开始工作了。

这是一个带有键引用的 任意 )示例:

config <- spark_config()
config$spark.driver.cores <- 32
config$spark.executor.cores <- 32
config$spark.executor.memory <- "40g"

library(sparklyr)

Sys.setenv(SPARK_HOME = "/usr/local/spark")
Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')

config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"

sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')

R Bloggers Link to Article

答案 2 :(得分:0)

您是否可以使用Cloudera Hadoop(CDH)?

我问我在使用CDH提供的Spark发行版时遇到了同样的问题:

Sys.getenv('SPARK_HOME')
[1] "/usr/lib/spark"  # CDH-provided Spark
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
Error in sparkapi::start_shell(master = master, spark_home = spark_home,  : 
      Failed to launch Spark shell. Ports file does not exist.
        Path: /usr/lib/spark/bin/spark-submit
        Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /tmp/Rtmp6RwEnV/file307975dc1ea0.out

Ivy Default Cache set to: /home/oracle/.ivy2/cache
The jars for the packages stored in: /home/oracle/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found com.databricks#spark-csv_2.11;1.3.0 in central
    found org.apache.commons#commons-csv;1.1 in central
    found com.univocity#univocity-parsers;1.5.1 in central
    found com.

但是,在我从Databricks(Spark 1.6.1,Hadoop 2.6)下载预构建版本并指向SPARK_HOME之后,我能够成功连接:

Sys.setenv(SPARK_HOME = '/home/oracle/spark-1.6.1-bin-hadoop2.6') 
sc <- spark_connect(master = "yarn-client") # OK
library(dplyr)
iris_tbl <- copy_to(sc, iris)
src_tbls(sc)
[1] "iris"

Cloudera在其发行版中尚未包含SparkR,并且我怀疑 sparklyr可能仍然对SparkR有一些微妙的依赖。以下是尝试使用CDH提供的Spark时的结果,但使用config=list()参数,如sparklyr中Github sc <- spark_connect(master='yarn-client', config=list()) # with CDH-provided Spark Error in sparkapi::start_shell(master = master, spark_home = spark_home, : Failed to launch Spark shell. Ports file does not exist. Path: /usr/lib/spark/bin/spark-submit Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', sparkr-shell, /tmp/Rtmpi9KWFt/file22276cf51d90.out Error: sparkr.zip does not exist for R application in YARN mode. 问题中的建议:

Parameters

此外,如果您检查错误sparkr-shell部分的最右边部分(包括您和我的),您会看到对sparklyr的引用...

(使用sparkapi 0.2.28 <table> <tr> <th>Employee Id</th> <th>Employee Name</th> <th>Salary</th> <th>Email</th> </tr> <tr> <td ></td>// for id <td><input type="text" id="new_name"></td> <td><input type="text" id="new_salary"></td> <td><input type="text" id="new_email"></td> </tr> </table> 0.3.15,来自RStudio Server,Oracle Linux的R会话进行测试)

答案 3 :(得分:0)

对于此问题,建议升级到sparklyr版本0.2.30或更新版本。使用devtools::install_github("rstudio/sparklyr")升级,然后重新启动r会话。