Question

我正在使用PySpark（对我来说是新事物）。现在，假设我有下表： +-------+-------+----------+ | Col1 | Col2 | Question | +-------+-------+----------+ | val11 | val12 | q1 | | val21 | val22 | q2 | | val31 | val32 | q3 | +-------+-------+----------+ 并且我想在其后追加一个新列random_qustion，它实际上是Question列中值的排列，因此结果可能如下所示： +-------+-------+----------+-----------------+ | Col1 | Col2 | Question | random_question | +-------+-------+----------+-----------------+ | val11 | val12 | q1 | q2 | | val21 | val22 | q2 | q3 | | val31 | val32 | q3 | q1 | +-------+-------+----------+-----------------+ 我尝试这样做，如下所示： python df.withColumn( 'random_question' ,df.orderBy(rand(seed=0))['question'] ).createOrReplaceTempView('with_random_questions') 问题是上面的代码确实添加了必需的列，但没有置换其中的值。

我在做什么错，我该如何解决？

谢谢

吉拉德

Answer 1

这应该可以解决问题：

import pyspark.sql.functions as F

questions = df.select(F.col('Question').alias('random_question'))
random = questions.orderBy(F.rand())

为数据框赋予唯一的行ID：

df = df.withColumn('row_id', F.monotonically_increasing_id())
random = random.withColumn('row_id', F.monotonically_increasing_id())

通过行ID加入它们：

final_df = df.join(random, 'row_id')

PySpark DataFrame-追加单个列的随机排列

1 个答案: