R中的火花仓库错误

时间:2016-10-18 14:30:20

标签: r sparkr

我在Windows 10 PC上安装了spark spark-2.0.0-bin-hadoop2.7,我想在R中使用SparkR包。 但是,当我运行以下示例代码时:

library(SparkR)

# Initialize SparkSession
sparkR.session(appName = "SparkR-DataFrame-example")

# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))

# Convert local data frame to a SparkDataFrame
df <- createDataFrame(localDF)

它引发了异常:

Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/Users/Louagyd/Desktop/EDU%20%202016-2017/Data%20Analysis/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)     
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)    
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)     
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)     
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)    at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)    
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)   
at org.apache.spark.sql.internal.SessionState.analyzer(Session

任何想法如何解决?

1 个答案:

答案 0 :(得分:0)

我也遇到了同样的错误,但网上没有任何帮助。但是,我通过以下步骤解决了这个问题:

准备工作

  1. here下载winutils.exe并安装它。
  2. 创建一个名为“C:\ tmp \ hive”的文件夹。此文件夹将用作仓库目录。
  3. 在命令提示符( cmd )中运行 winutils.exe chmod 777 \ tmp \ hive 。确保winutils在您的类路径中。如果没有,请将其添加到环境变量中。
  4. 确保系统中已安装SPARK。就我而言,它安装在“C:/spark-2.0.0-bin-hadoop2.7”文件夹下。
  5. 主要

    1. 打开RStudio后,在任何目录中创建一个新项目(比如“C:/ home / Project / SparkR”)
    2. 在RStudio的脚本窗口中,按相同的顺序运行以下命令:

      # Set Working Dir - The same folder under which R Project was created
      setwd("C:/home/Project/SparkR")
      
      # Load Env variable SPARK_HOME, if not already loaded.
      # If this variable is already set in Window's Env variable, this step is not required
      if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
        Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.7")
      }
      
      # Load SparkR library
      library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
      
      # Create a Config variable mapping Memory to be allocated and Warehouse directory to be referred during runtime.
      sparkConf = list(spark.driver.memory = "2g", spark.sql.warehouse.dir="C:/tmp")
      # Create SparkR Session variable
      sparkR.session(master = "local[*]", sparkConfig = sparkConf)
      
      # Load existing data from SparkR library
      DF <- as.DataFrame(faithful)
      # Inspect loaded data
      head(DF)
      
    3. 通过上述步骤,我可以成功加载数据并查看它们。