我有一个hive表,其中包含array<bigint>
类型的列(c4)。现在,我想用spark提取这个列。所以,这是代码片段:
val query = """select c1, c2, c3, c4 from
some_table where some_condition"""
val rddHive = hiveContext.sql(query).rdd.map{ row =>
//is there any other ways to extract wid_list(String here seems not work)
//no compile error and no runtime error
val w = if (row.isNullAt(3)) List() else row.getAs[scala.collection.mutable.WrappedArray[String]]("wid_list").toList
w
}
-> rddHive: org.apache.spark.rdd.RDD[List[String]] = MapPartitionsRDD[7] at map at <console>:32
rddHive.map(x => x(0).getClass.getSimpleName).take(1)
-> Array[String] = Array[Long]
因此,我使用getAs[scala.collection.mutable.WrappedArray[String]]
提取c4,而原始数据类型为array<int>
。但是,没有编译错误或运行时错误。我提取的数据仍然是bigint(Long)类型。那么,这里发生了什么(为什么没有编译器错误或运行时错误)?在Spark中将array<int>
提取为List[String]
的正确方法是什么?
==================添加更多信息====================
hiveContext.sql(query).printSchema
root
|-- c1: string (nullable = true)
|-- c2: integer (nullable = true)
|-- c3: string (nullable = true)
|-- c4: array (nullable = true)
| |-- element: long (containsNull = true)
hiveContext.sql(query).show(3)
+--------+----+----------------+--------------------+
| c1| c2| c3| c4|
+--------+----+----------------+--------------------+
| c1111| 1|5511798399.22222|[21772244666, 111...|
| c1112| 1|5511798399.88888|[11111111, 111111...|
| c1113| 2| 5555117114.3333|[77777777777, 112...|