Spark将数组<double>的列写入Hive表

时间:2016-07-19 09:02:00

标签: scala apache-spark apache-spark-sql

使用Spark 1.6,我尝试将数组保存到由两列组成的Hive-Table myTable,每列都为array<double>类型:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._

val x = Array(1.0,2.0,3.0,4.0)
val y = Array(-1.0,-2.0,-3.0,-4.0)

val mySeq = Seq(x,y)
val df = sc.parallelize(mySeq).toDF("x","y")
df.write.insertInto("myTable")

但接着我收到了消息:

error: value toDF is not a member of org.apache.spark.rdd.RDD[Array[Double]]
              val df = sc.parallelize(mySeq).toDF("x","y")

执行此简单任务的正确方法是什么?

1 个答案:

答案 0 :(得分:0)

我假设你想要的实际结构如下: X | Y 1.0 | -1.0 2.0 | -2.0 3.0 | -3.0 4.0 | -4.0

为此,您想要的代码是:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._

val x = Array(1.0,2.0,3.0,4.0)
val y = Array(-1.0,-2.0,-3.0,-4.0)

val mySeq = x.zip(y)
val df = sc.parallelize(mySeq).toDF("x","y")
df.write.insertInto("myTable")

基本上你需要一组行对象(即:Array [Row])。最好使用另一条评论中提到的案例类而不仅仅是元组。