访问WrappedArray元素

时间:2017-06-10 00:25:58

标签: python scala apache-spark pyspark

我有一个spark数据帧,这里是架构:

|-- eid: long (nullable = true)
|-- age: long (nullable = true)
|-- sex: long (nullable = true)
|-- father: array (nullable = true)
|    |-- element: array (containsNull = true)
|    |    |-- element: long (containsNull = true)

和行样本:。

df.select(df['father']).show()
+--------------------+
|              father|
+--------------------+
|[WrappedArray(-17...|
|[WrappedArray(-11...|
|[WrappedArray(13,...|
+--------------------+

,类型是

DataFrame[father: array<array<bigint>>]

如何访问内部数组的每个元素?比如第一行-17? 我尝试了df.select(df['father'])(0)(0).show()等不同的东西,但没有运气。

3 个答案:

答案 0 :(得分:4)

如果我没有弄错,Python中的语法是

df.select(df['father'])[0][0].show()

df.select(df['father']).getItem(0).getItem(0).show()

请参阅此处的一些示例:http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=column#pyspark.sql.Column

答案 1 :(得分:2)

scala中的解决方案应该是

import org.apache.spark.sql.functions._
val data =  sparkContext.parallelize("""{"eid":1,"age":30,"sex":1,"father":[[1,2]]}""" :: Nil)
val dataframe = sqlContext.read.json(data).toDF()

数据框看起来像

+---+---+---+--------------------+
|eid|age|sex|father              |
+---+---+---+--------------------+
|1  |30 |1  |[WrappedArray(1, 2)]|
+---+---+---+--------------------+

解决方案应该是

dataframe.select(col("father")(0)(0) as("first"), col("father")(0)(1) as("second")).show(false)

输出应该是

+-----+------+
|first|second|
+-----+------+
|1    |2     |
+-----+------+

答案 2 :(得分:1)

另一个scala答案如下:

df.select(col("father").getItem(0) as "father_0", col("father").getItem(1) as "father_1")