在bdutil部署的群集

时间:2015-05-27 12:22:06

标签: r apache-spark google-hadoop

我已经使用bdutil一年了,带有hadoop和spark,这是非常完美的! 现在,我试图让SparkR与Google Storage一起使用HDFS时遇到了一些问题。

这是我的设置: - bdutil 1.2.1 - 我已经部署了一个集群,其中包含1个master和1个安装了Spark 1.3.0的worker - 在主人和工人上安装了R和SparkR

当我在主节点上运行SparkR时,我试图在我的GS存储桶上指向一个目录:

1)通过设置gs文件系统方案

> file <- textFile(sc, "gs://xxxxx/dir/")
> count(file)
15/05/27 12:02:02 WARN LoadSnappy: Snappy native library is available
15/05/27 12:02:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/05/27 12:02:02 WARN LoadSnappy: Snappy native library not loaded
collect on 5 failed with java.lang.reflect.InvocationTargetException
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.handleMethodCall(SparkRBackendHandler.scala:111)
        at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.channelRead0(SparkRBackendHandler.scala:58)
        at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.channelRead0(SparkRBackendHandler.scala:19)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: No FileSystem for scheme: gs
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1383)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at edu.berkeley.cs.amplab.sparkr.BaseRRDD.getPartitions(RRDD.scala:31)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1511)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
        at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:312)
        at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:32)
        ... 25 more
Error: returnStatus == 0 is not TRUE

2)使用HDFS URL

> file <- textFile(sc, "hdfs://hadoop-stage-m:8020/dir/")
> count(file)
collect on 10 failed with java.lang.reflect.InvocationTargetException
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.handleMethodCall(SparkRBackendHandler.scala:111)
        at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.channelRead0(SparkRBackendHandler.scala:58)
        at edu.berkeley.cs.amplab.sparkr.SparkRBackendHandler.channelRead0(SparkRBackendHandler.scala:19)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://hadoop-stage-m:8020/dir
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at edu.berkeley.cs.amplab.sparkr.BaseRRDD.getPartitions(RRDD.scala:31)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1511)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
        at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:312)
        at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:32)
        ... 25 more
Error: returnStatus == 0 is not TRUE

3)使用Scala在我的其他Spark作业上使用的路径:与2)完全相同的错误

我确定我错过了一个明显的步骤。如果有人可以帮我解决这个问题,那就太棒了!

谢谢,

PS:我100%确定gcs连接器正在处理传统的Scala作业!

1 个答案:

答案 0 :(得分:2)

简答

您需要在类路径上使用core-site.xml,hdfs-site.xml等,以及gcs-connector-1.3.3-hadoop1.jar。用以下内容完成:

export YARN_CONF_DIR=/home/hadoop/hadoop-install/conf:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.3-hadoop1.jar
./sparkR

您可能还需要其他spark-env.sh设置;考虑另外运行:

source /home/hadoop/spark-install/conf/spark-env.sh

./sparkR之前。如果您在R中手动调用sparkR.init,那么这不是必要的,因为您将直接传递master之类的参数。

其他可能的陷阱:

  1. 确保您的默认Java是Java 7.如果是Java 6,请运行sudo update-alternatives --config java并选择Java 7作为默认值。
  2. 构建sparkR时,请务必设置Spark版本:SPARK_VERSION=1.3.0 ./install-dev.sh
  3. 长答案

    通常,&#34; No FileSystem for scheme&#34;错误意味着我们需要确保core-site.xml在类路径上;修复类路径后遇到的第二个错误是&#34; java.lang.ClassNotFoundException:com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem&#34;这意味着我们还需要将gcs-connector-1.3.3.jar添加到类路径中。查看SparkR帮助程序脚本,主sparkR二进制文件使用以下内容调用sparkR.init

      sc <- sparkR.init(Sys.getenv("MASTER", unset = ""))
    

    MASTER环境变量通常位于spark-env.sh脚本中,实际上bdutil填充MASTER下的/home/hadoop/spark-install/conf/spark-env.sh环境变量。通常,这应该表明简单地添加source /home/hadoop/spark-install/conf/spark-env.sh应该充分填充SparkR的必要设置,但是如果我们查看sparkR定义,我们会看到:

    #' Initialize a new Spark Context.
    #'
    #' This function initializes a new SparkContext.
    #'
    #' @param master The Spark master URL.
    #' @param appName Application name to register with cluster manager
    #' @param sparkHome Spark Home directory
    #' @param sparkEnvir Named list of environment variables to set on worker nodes.
    #' @param sparkExecutorEnv Named list of environment variables to be used when launching executors.
    #' @param sparkJars Character string vector of jar files to pass to the worker nodes.
    #' @param sparkRLibDir The path where R is installed on the worker nodes.
    #' @param sparkRBackendPort The port to use for SparkR JVM Backend.
    #' @export
    #' @examples
    #'\dontrun{
    #' sc <- sparkR.init("local[2]", "SparkR", "/home/spark")
    #' sc <- sparkR.init("local[2]", "SparkR", "/home/spark",
    #'                  list(spark.executor.memory="1g"))
    #' sc <- sparkR.init("yarn-client", "SparkR", "/home/spark",
    #'                  list(spark.executor.memory="1g"),
    #'                  list(LD_LIBRARY_PATH="/directory of JVM libraries (libjvm.so) on workers/"),
    #'                  c("jarfile1.jar","jarfile2.jar"))
    #'}
    
    sparkR.init <- function(
      master = "",
      appName = "SparkR",
      sparkHome = Sys.getenv("SPARK_HOME"),
      sparkEnvir = list(),
      sparkExecutorEnv = list(),
      sparkJars = "",
      sparkRLibDir = "") {
    
      <...>
      cp <- paste0(jars, collapse = collapseChar)
    
      yarn_conf_dir <- Sys.getenv("YARN_CONF_DIR", "")
      if (yarn_conf_dir != "") {
        cp <- paste(cp, yarn_conf_dir, sep = ":")
      }
      <...>
    
        if (Sys.getenv("SPARKR_USE_SPARK_SUBMIT", "") == "") {
          launchBackend(classPath = cp,
                        mainClass = "edu.berkeley.cs.amplab.sparkr.SparkRBackend",
                        args = path,
                        javaOpts = paste("-Xmx", sparkMem, sep = ""))
        } else {
          # TODO: We should deprecate sparkJars and ask users to add it to the
          # command line (using --jars) which is picked up by SparkSubmit
          launchBackendSparkSubmit(
              mainClass = "edu.berkeley.cs.amplab.sparkr.SparkRBackend",
              args = path,
              appJar = .sparkREnv$assemblyJarPath,
              sparkHome = sparkHome,
              sparkSubmitOpts = Sys.getenv("SPARKR_SUBMIT_ARGS", ""))
        }
    

    这告诉我们三件事:

    1. 默认sparkR脚本无法通过sparkJars,因此似乎不是将libjars作为标记传递的最新方式。
    2. 无论如何,有一个TODO弃用sparkJars参数。
    3. 除了sparkJars param之外,cp / classPath参数中唯一的另一个问题是YARN_CONF_DIR(除非我错过了其他一些来源classpath添加,或者如果我使用的是不同版本的sparkR。此外,幸运的是,即使您不打算在YARN上投放,也似乎会使用YARN_CONF_DIR
    4. 总之,这表明你可能至少想要/home/hadoop/spark-install/conf/spark-env.sh中的变量,因为至少有一些钩子似乎寻找那里常见的环境变量,其次我们应该能够攻击{{1}指定类路径以使其找到core-site.xml以及将gcs-connector-1.3.3.jar添加到类路径中。

      所以,你的问题的答案是:

      YARN_CONF_DIR

      如果您正在使用hadoop2或其他一些gcs-connector版本,则可能需要更改export YARN_CONF_DIR=/home/hadoop/hadoop-install/conf:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.3-hadoop1.jar ./sparkR 部分。该命令修复了HDFS访问以及为gcs-connector查找/home/hadoop/hadoop-install/lib/gcs-connector-1.3.3-hadoop1.jar以及确保实际的gcs-connector jar在类路径上。它没有引入fs.gs.impl,因此您可能会发现它默认为使用spark-env.sh运行。假设您的工作节点也正确安装了SparkR,您可以考虑运行以下命令:

      MASTER=local

      基于我遇到的情况,还有一些额外的警告:

      1. 您可能会发现您的R安装设置了较旧的Java版本。如果您遇到类似&#34;不支持的major.minor版本51.0&#34;之类的内容,请运行source /home/hadoop/spark-install/conf/spark-env.sh export YARN_CONF_DIR=/home/hadoop/hadoop-install/conf:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.3-hadoop1.jar ./sparkR 并使Java 7成为默认值。
      2. 如果你正在使用Spark 1.3.0,如果你正在使用SparkR&#39; sudo update-alternatives --config java,Spark可能会错误地挂起&#34;初始作业没有接受任何资源;检查您的集群UI以确保工作人员已注册并具有足够的内存&#34;实际上调度程序快速失败时会出现serialVersionUID不匹配问题,您可以在/hadoop/spark/logs/*Master*.out中看到 - 解决方法是确保使用正确的Spark版本集运行install-dev.sh: install-dev.sh