通过保留顺序来收集

时间:2019-10-10 07:42:36

标签: apache-spark pyspark apache-spark-sql

我指的是这个问题Here,但是它适用于collect_list而不是collect_set

我有一个这样的数据框

    data = [(("ID1", 9)), 
            (("ID1", 9)),
            (("ID1", 8)),
            (("ID1", 7)),
            (("ID1", 5)),
            (("ID1", 5))]
df = spark.createDataFrame(data, ["ID", "Values"])
df.show()

+---+------+
| ID|Values|
+---+------+
|ID1|     9|
|ID1|     9|
|ID1|     8|
|ID1|     7|
|ID1|     5|
|ID1|     5|
+---+------+

我正在尝试创建一个新列,将其收集为set

df = df.groupBy('ID').agg(collect_set('Values').alias('Value_set'))
df.show()

+---+------------+
| ID|   Value_set|
+---+------------+
|ID1|[9, 5, 7, 8]|
+---+------------+

但订单未维护,我的订单应为[9, 8, 7, 5]

4 个答案:

答案 0 :(得分:0)

pyspark源代码中获取collect_set的文档:

_collect_set_doc = """
    Aggregate function: returns a set of objects with duplicate elements eliminated.

    .. note:: The function is non-deterministic because the order of collected results depends
        on order of rows which may be non-deterministic after a shuffle.

    >>> df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
    >>> df2.agg(collect_set('age')).collect()
    [Row(collect_set(age)=[5, 2])]
    """

这意味着,您将具有基于hash表的无序集合,并且可以在'order' of unordered Python sets上获得更多信息

答案 1 :(得分:0)

如果使用spark 2.4或更高版本,则可以将array_sort()函数应用于您的列:

答案 2 :(得分:0)

我这样解决了

df = df.groupby('ID').agg(collect_list('Values').alias('Values_List'))
df.show()

def my_function(x):
    return list(dict.fromkeys(x))

udf_set = udf(lambda x: my_function(x), ArrayType(IntegerType()))
df = df.withColumn("Values_Set", udf_set("Values_List")) 

df.show(truncate=False)

+---+------------------+------------+
|ID |Values_List       |Values_Set  |
+---+------------------+------------+
|ID1|[9, 9, 8, 7, 5, 5]|[9, 8, 7, 5]|
+---+------------------+------------+

答案 3 :(得分:0)

如果你的数据比较小,可以合并为1,然后排序后再使用collect_set()

例如:关系,索引

cook,3
jone,1
sam,7
zack,4
tim,2
singh,9
ambani,5
ram,8
jack,0
nike,6

df.coalesce(1).sort("ind").agg(collect_list("name").alias("names_list")).show

names_list

[jack, jone, tim, cook, zack, ambani, nike, sam, ram, singh]