Question

我正在尝试在数据框中创建一个新列，这只是现有列的改组版本。我可以使用How to shuffle the rows in a Spark dataframe?中描述的方法对数据框中的行进行随机排序，但是当我尝试将列的改组版本添加到数据帧时，它似乎不执行改组。 / p>

import pyspark
import pyspark.sql.functions as F

spark = pyspark.sql.SparkSession.builder.getOrCreate()

df = spark.range(5).toDF("x")
df.show()
#> +---+
#> |  x|
#> +---+
#> |  0|
#> |  1|
#> |  2|
#> |  3|
#> |  4|
#> +---+

# the rows appear to be shuffled
ordered_df = df.orderBy(F.rand())
ordered_df.show()
#> +---+
#> |  x|
#> +---+
#> |  0|
#> |  2|
#> |  3|
#> |  4|
#> |  1|
#> +---+

# ...but when i try to add this column to the df, they are no longer shuffled
df.withColumn('y', ordered_df.x).show()
#> +---+---+
#> |  x|  y|
#> +---+---+
#> |  0|  0|
#> |  1|  1|
#> |  2|  2|
#> |  3|  3|
#> |  4|  4|
#> +---+---+

^{由reprexpy package}

创建于2019-06-28

一些注意事项：

我想找到一种解决方案，其中数据保留在Spark中。例如，我不需要使用用户定义的函数，该函数要求将数据移出JVM。
PySpark: Randomize rows in dataframe中的解决方案对我不起作用（见下文）。

df = spark.sparkContext.parallelize(range(5)).map(lambda x: (x, )).toDF(["x"])

df.withColumn('y', df.orderBy(F.rand()).x).show()
#> +---+---+
#> |  x|  y|
#> +---+---+
#> |  0|  0|
#> |  1|  1|
#> |  2|  2|
#> |  3|  3|
#> |  4|  4|
#> +---+---+

我必须对许多列中的行进行混洗，并且每一列都必须独立于其他各列进行混洗。因此，我宁愿不要在https://stackoverflow.com/a/45889539中使用zipWithIndex()解决方案，因为该解决方案将要求我对数据运行许多联接（我认为这将是耗时的）。 / li>

Answer 1

您可以使用窗口函数为每行分配一个随机索引，然后在单独的DF中再次执行此操作，然后加入索引：

>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>> df = spark.range(5).toDF("x")
>>> left = df.withColumn("rnd", F.row_number().over(Window.orderBy(F.rand())))
>>> right = df.withColumnRenamed("x", "y").withColumn("rnd", F.row_number().over(Window.orderBy(F.rand()))) 
>>> dff = left.join(right, left.rnd == right.rnd).drop("rnd")
>>> dff.show()
19/06/29 13:17:04 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
19/06/29 13:17:04 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+---+                                                                       
|  x|  y|
+---+---+
|  3|  3|
|  2|  0|
|  0|  2|
|  1|  1|
|  4|  4|
+---+---+

如警告所示，在实践中这可能不是一个好主意。

整理Spark数据框中的行

1 个答案: