我按照几列进行分组,正如您在架构中看到的那样,从这些列中获取WrappedArray
。如何摆脱它们,以便我可以继续下一步并进行orderBy
?
val sqlDF = spark.sql("SELECT * FROM
parquet.`parquet/20171009121227/rels/*.parquet`")
获取dataFrame:
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
然后打印架构会给我们:final_df.printSchema
|-- rel: array (nullable = true)
| |-- element: double (containsNull = true)
|-- rel2: array (nullable = true)
| |-- element: double (containsNull = true)
采样电流输出:
我想转换成这个:
|-- rel: double (nullable = true)
|-- rel2: double (nullable = true)
所需的示例输出(如上图所示):
-1.0,0.0
-1.0,0.0
答案 0 :(得分:1)
尝试col(x).getItem
:
groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2")
).withColumn("rel_0", col("rel").getItem(0))
答案 1 :(得分:1)
如果collect_list
始终只返回一个值,请改用first
。然后就没有必要处理有阵列的问题了。请注意,这应该在groupBy
步骤中完成。
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val final_df = df.groupBy(...)
.agg(first($"relev").as("rel"),
first($"relev2").as("rel2"))
答案 2 :(得分:0)
尝试拆分
import org.apache.spark.sql.functions._
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
.withColumn("rel",split("rel",","))