我正在尝试在数据框中创建一个新列,这只是现有列的改组版本。我可以使用How to shuffle the rows in a Spark dataframe?中描述的方法对数据框中的行进行随机排序,但是当我尝试将列的改组版本添加到数据帧时,它似乎不执行改组。 / p>
import pyspark
import pyspark.sql.functions as F
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.range(5).toDF("x")
df.show()
#> +---+
#> | x|
#> +---+
#> | 0|
#> | 1|
#> | 2|
#> | 3|
#> | 4|
#> +---+
# the rows appear to be shuffled
ordered_df = df.orderBy(F.rand())
ordered_df.show()
#> +---+
#> | x|
#> +---+
#> | 0|
#> | 2|
#> | 3|
#> | 4|
#> | 1|
#> +---+
# ...but when i try to add this column to the df, they are no longer shuffled
df.withColumn('y', ordered_df.x).show()
#> +---+---+
#> | x| y|
#> +---+---+
#> | 0| 0|
#> | 1| 1|
#> | 2| 2|
#> | 3| 3|
#> | 4| 4|
#> +---+---+
创建于2019-06-28
一些注意事项:
df = spark.sparkContext.parallelize(range(5)).map(lambda x: (x, )).toDF(["x"])
df.withColumn('y', df.orderBy(F.rand()).x).show()
#> +---+---+
#> | x| y|
#> +---+---+
#> | 0| 0|
#> | 1| 1|
#> | 2| 2|
#> | 3| 3|
#> | 4| 4|
#> +---+---+
zipWithIndex()
解决方案,因为该解决方案将要求我对数据运行许多联接(我认为这将是耗时的)。 / li>
答案 0 :(得分:1)
您可以使用窗口函数为每行分配一个随机索引,然后在单独的DF中再次执行此操作,然后加入索引:
>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>> df = spark.range(5).toDF("x")
>>> left = df.withColumn("rnd", F.row_number().over(Window.orderBy(F.rand())))
>>> right = df.withColumnRenamed("x", "y").withColumn("rnd", F.row_number().over(Window.orderBy(F.rand())))
>>> dff = left.join(right, left.rnd == right.rnd).drop("rnd")
>>> dff.show()
19/06/29 13:17:04 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
19/06/29 13:17:04 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+---+
| x| y|
+---+---+
| 3| 3|
| 2| 0|
| 0| 2|
| 1| 1|
| 4| 4|
+---+---+
如警告所示,在实践中这可能不是一个好主意。