列值顺序

时间:2019-07-02 13:10:18

标签: pyspark apache-spark-sql

以下是我的DataFrame:

Ref °     | indice_1 | Indice_2      | rank_1    |   rank_2   |  echelon_from     |    section_from      |      echelon_to    |  section_to 
--------------------------------------------------------------------------------------------------------------------------------------------
70574931  |   19     |   37.1        |  32       |    62      |  ["10032,20032"]  |   ["11/12","13"]     |      ["40062"]     |   ["14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
70574931  |   18     |   36          |  32       |    62      |     ["20032"]     |      ["13"]          |    ["30062,40062"] |  ["14,14A"]

我希望将参考编号相同的行连接起来,以将echelon_from值,section_from值,echelon_to值和section_to值与其中重复的值连接起来,例如下面的示例,并且不要触摸其余的列。

Ref °     | Indice_1 | Indice_2      | rank_1    |   rank_2   |  echelon_from     |    section_from      |      echelon_to    |  section_to  
---------------------------------------------------------------------------------------------------------------------------------------------
70574931  |   19     |   37.1        |  32       |    62      |  ["10032,20032"]  |   ["11/12","13"]     |     ["30062,40062"] |  ["14,14A"]
----------------------------------------------------------------------------------------------------------------------------------------------
70574931  |   18     |   36          |  32       |    62      |  ["10032,20032"]  |   ["11/12","13"]     |    ["30062,40062"] |  ["14,14A"]

原始Dataframe中的某些列值是重复的,我不应该碰它们,我应该保留在那里的值,以保持与DataFrame相同的行数。 有人可以帮助我,我该怎么做?

谢谢!

1 个答案:

答案 0 :(得分:0)

有多种方法可以做到这一点。一种方法是分解所有给定列表,然后将它们作为集合再次收集回去。

UpdateReview

另一种更优雅的方法是使用from pyspark.sql import functions as F lists_to_concat = ['echelon_from', 'section_from', 'echelon_to', 'section_to'] columns_not_to_concat = [c for c in df.columns if c not in lists_to_concat] for c in lists_to_concat: df = df.withColumn(c, F.explode(c)) df = ( df .groupBy(*columns_not_to_concat) .agg( *[F.collect_set(c).alias(c) for c in lists_to_concat] ) )

flatten()

参考: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.flatten