具有复杂数据类型的SparkR + Cassandra查询

时间:2016-07-13 15:29:47

标签: r apache-spark cassandra spark-dataframe sparkr

我们有SparkR设置连接到Cassandra,我们能够成功连接/查询Cassandra数据。但是,我们的许多Cassandra列族都有像MapType这样复杂的数据类型,在查询这些类型时会出现错误。有没有办法在使用SparkR查询之前或期间强制这些?例如,相同数据的cqlsh命令会将下面一行MapType列b强制转换为字符串,如“{38:262,97:21,98:470}”

Sys.setenv(SPARK_HOME = "/opt/spark")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

mySparkPackages <- "datastax:spark-cassandra-connector:1.6.0-s_2.10"
mySparkEnvironment <- list(
  spark.local.dir="...",
  spark.eventLog.dir="...",
  spark.cassandra.connection.host="...",
  spark.cassandra.auth.username="...",
  spark.cassandra.auth.password="...")

sc <- sparkR.init(master="...", sparkEnvir=mySparkEnvironment,sparkPackages=mySparkPackages)
sqlContext <- sparkRSQL.init(sc)

spark.df <- read.df(sqlContext,
                    source = "org.apache.spark.sql.cassandra",
                    keyspace = "...",
                    table = "...")

spark.df.sub <- subset(spark.df, (...)), select = c(1,2))
schema(spark.df.sub)

StructType
|-name = "a", type = "IntegerType", nullable = TRUE
|-name = "b", type = "MapType(IntegerType,IntegerType,true)", nullable = TRUE

r.df.sub <- collect(spark.df.sub, stringsAsFactors = FALSE)

这里我们从collect()中得到这个错误:

16/07/13 12:13:50 INFO TaskSetManager: Finished task 1756.0 in stage 0.0     (TID 1756) in 1525 ms on ip-10-225-70-184.ec2.internal (1757/1758)
16/07/13 12:13:50 INFO TaskSetManager: Finished task 1755.0 in stage 0.0 (TID 1755) in 1661 ms on ip-10-225-70-184.ec2.internal (1758/1758)
16/07/13 12:13:50 INFO DAGScheduler: ResultStage 0 (dfToCols at NativeMethodAccessorImpl.java:-2) finished in 2587.670 s
16/07/13 12:13:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/07/13 12:13:50 INFO DAGScheduler: Job 0 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 2588.088830 s
16/07/13 12:13:51 ERROR RBackendHandler: dfToCols on org.apache.spark.sql.api.r.SQLUtils failed
Error in readBin(con, raw(), stringLen, endian = "big") : 
  invalid 'n' argument

我们的筹码:

Ubuntu 14.04.4 LTS Trusty Tahr

Cassandra v 2.1.14

Scala 2.10.6

Spark 1.6.2 with Hadoop libs 2.6

用于Scala 2.10的Spark-Cassandra连接器1.6.0

DataStax Cassandra Java驱动程序v3.0(实际上是v3.0.1)

Microsoft R Open又名为Revo R版本3.2.5 with MTL

Rstudio服务器0.99.902

0 个答案:

没有答案