如何将csv读入sparkR ver 1.4?

时间:2015-07-03 10:50:40

标签: r csv apache-spark apache-spark-sql sparkr

spark(1.4)的新版本发布时,spark名为R的{​​{1}}包中似乎有一个不错的前端界面。在documentation page of R for spark上有一个命令,可以将sparkR个文件作为RDD对象读取

json

我正在尝试从this revolutionanalitics' blog

中描述的people <- read.df(sqlContext, "./examples/src/main/resources/people.json", "json") 文件中读取数据
.csv

说明我需要一个spark-csv包才能启用此操作。所以我使用以下命令从github repo下载了这个包:

# Download the nyc flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv

# Launch SparkR using 
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3

# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1

# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")

但是在尝试阅读$ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 文件时遇到了这样的错误。

.csv

了解此错误的含义以及如何解决此问题?

当然,我可以尝试以标准方式阅读> flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true") 15/07/03 12:52:41 ERROR RBackendHandler: load on 1 failed java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1230) ... 25 more Error: returnStatus == 0 is not TRUE ,例如:

.csv

然后我可以将R read.table("data.csv") -> flights 转换为data.frame的{​​{1}},如下所示:

spark

但这不是我喜欢的方式,而且非常耗时。

3 个答案:

答案 0 :(得分:13)

你必须每次都启动sparkR控制台:

sparkR --packages com.databricks:spark-csv_2.10:1.0.3

答案 1 :(得分:5)

如果您使用的是Rstudio:

 library(SparkR)
 Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')
 sqlContext <- sparkRSQL.init(sc)

诀窍。确保为spark-csv指定的版本与您下载的版本匹配。

答案 2 :(得分:-1)

确保使用以下命令从spark中安装sparkr:

install.packages("C:/spark/R/lib/sparkr.zip", repos = NULL)

而不是来自github

为我解决了这个问题。