我们有SparkR设置连接到Cassandra,我们能够成功连接/查询Cassandra数据。但是,我们的许多Cassandra列族都有像MapType这样复杂的数据类型,在查询这些类型时会出现错误。有没有办法在使用SparkR查询之前或期间强制这些?例如,相同数据的cqlsh命令会将下面一行MapType列b强制转换为字符串,如“{38:262,97:21,98:470}”
Sys.setenv(SPARK_HOME = "/opt/spark")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
mySparkPackages <- "datastax:spark-cassandra-connector:1.6.0-s_2.10"
mySparkEnvironment <- list(
spark.local.dir="...",
spark.eventLog.dir="...",
spark.cassandra.connection.host="...",
spark.cassandra.auth.username="...",
spark.cassandra.auth.password="...")
sc <- sparkR.init(master="...", sparkEnvir=mySparkEnvironment,sparkPackages=mySparkPackages)
sqlContext <- sparkRSQL.init(sc)
spark.df <- read.df(sqlContext,
source = "org.apache.spark.sql.cassandra",
keyspace = "...",
table = "...")
spark.df.sub <- subset(spark.df, (...)), select = c(1,2))
schema(spark.df.sub)
StructType
|-name = "a", type = "IntegerType", nullable = TRUE
|-name = "b", type = "MapType(IntegerType,IntegerType,true)", nullable = TRUE
r.df.sub <- collect(spark.df.sub, stringsAsFactors = FALSE)
这里我们从collect()中得到这个错误:
16/07/13 12:13:50 INFO TaskSetManager: Finished task 1756.0 in stage 0.0 (TID 1756) in 1525 ms on ip-10-225-70-184.ec2.internal (1757/1758)
16/07/13 12:13:50 INFO TaskSetManager: Finished task 1755.0 in stage 0.0 (TID 1755) in 1661 ms on ip-10-225-70-184.ec2.internal (1758/1758)
16/07/13 12:13:50 INFO DAGScheduler: ResultStage 0 (dfToCols at NativeMethodAccessorImpl.java:-2) finished in 2587.670 s
16/07/13 12:13:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/07/13 12:13:50 INFO DAGScheduler: Job 0 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 2588.088830 s
16/07/13 12:13:51 ERROR RBackendHandler: dfToCols on org.apache.spark.sql.api.r.SQLUtils failed
Error in readBin(con, raw(), stringLen, endian = "big") :
invalid 'n' argument
Ubuntu 14.04.4 LTS Trusty Tahr
Cassandra v 2.1.14
Scala 2.10.6
Spark 1.6.2 with Hadoop libs 2.6
用于Scala 2.10的Spark-Cassandra连接器1.6.0
DataStax Cassandra Java驱动程序v3.0(实际上是v3.0.1)
Microsoft R Open又名为Revo R版本3.2.5 with MTL
Rstudio服务器0.99.902