我的数据框有3列,名为 id , feat1 和 feat2 。 feat1 和 feat2 采用字符串数组的形式:
Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[]
2, ["feat1_2"],["feat2_1","feat2_2"]
3,["feat1_4"],["feat2_3"]
我想获取每个要素列中的不同元素列表,因此输出将为:
distinct_feat1,distinct_feat2
-----------------------------
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]
在Scala中执行此操作的最佳方法是什么?
答案 0 :(得分:2)
在每列上应用collect_set
函数后,可以使用explode
查找相应列的不同值,以取消每个单元格中的数组元素。假设您的数据框名为df
:
import org.apache.spark.sql.functions._
val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
withColumn("feat2", explode(col("feat2"))).
agg(collect_set("feat1").alias("distinct_feat1"),
collect_set("feat2").alias("distinct_feat2"))
distinct_df.show
+--------------------+--------------------+
| distinct_feat1| distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+
distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
WrappedArray(, feat2_1, feat2_2, feat2_3)])
答案 1 :(得分:0)
Psidom提供的方法效果很好,这是一个在给定Dataframe和字段列表的情况下执行相同操作的函数:
def array_unique_values(df, fields):
from pyspark.sql.functions import col, collect_set, explode
from functools import reduce
data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])
然后:
data = array_unique_values(df, my_fields)
data.take(1)