在Windows

时间:2016-05-03 09:26:45

标签: windows csv apache-spark cmd sparkr

我是Spark世界的初学者,并希望使用SparkR进行机器学习算法。

我在我的笔记本电脑(Win 7 64位)中以独立模式安装了Spark,我可以运行Spark(1.6.1),Pyspark并在Windows中按照这个有效指南开始SparkR:link。一旦我开始SparkR,我就开始着名的航班示例:

#Set proxy
Sys.setenv(http_proxy="http://user:password@proxy.companyname.es:8080/")
#Set SPARK_HOME
Sys.setenv(SPARK_HOME="C:/Users/amartinezsistac/spark-1.6.1-bin-hadoop2.4")
#Load SparkR and its library
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R", "lib"), .libPaths()))
library(SparkR)
#Set Spark Context and SQL Context
sc = sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
#Read Data
link <- "s3n://mortar-example-data/airline-data"
flights <- read.df(sqlContext, link, source = "com.databricks.spark.csv", header= "true")

尽管如此,我收到了最后一行之后的下一条错误消息:

Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
    at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:160)
    at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
    at org.apache.spark.api.r.RBackendHandler.ch

似乎原因是我没有安装read-csv包,可以从这个页面下载Github link)。和Stack一样,在spark-packages.org网站上,(link)建议:$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0用于Linux安装。

我的问题是:如何从Windows 7 cmd运行此代码行以下载此程序包?

我还尝试了一个替代解决方案来解决我的错误消息(Github),但没有成功:

#In master you don't need spark-csv. 
#CSV data source is built into SparkSQL. Just use it as follows:
flights <- read.df(sqlContext, "out/data.txt", source = "com.databricks.spark.csv", delimiter="\t", header="true", inferSchema="true")

提前感谢所有人。

1 个答案:

答案 0 :(得分:1)

Windows也是如此。从bin目录启动spark-shell时,请按以下方式启动:

spark-shell --packages com.databricks:spark-csv_2.11:1.4.0