无论如何,我可以随机播放RDD或数据帧的列,以使该列中的条目以随机顺序出现?我不确定我可以使用哪些API来完成这项任务。
答案 0 :(得分:3)
选择要移动的列,orderBy(rand)
列和zip it by index to the existing dataframe怎么样?
import org.apache.spark.sql.functions.rand
def addIndex(df: DataFrame) = spark.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
case class Entry(name: String, salary: Double)
val r1 = Entry("Max", 2001.21)
val r2 = Entry("Zhang", 3111.32)
val r3 = Entry("Bob", 1919.21)
val r4 = Entry("Paul", 3001.5)
val df = addIndex(spark.createDataFrame(Seq(r1, r2, r3, r4)))
val df_shuffled = addIndex(df
.select(col("salary").as("salary_shuffled"))
.orderBy(rand))
df.join(df_shuffled, Seq("_index"))
.drop("_index")
.show(false)
+-----+-------+---------------+
|name |salary |salary_shuffled|
+-----+-------+---------------+
|Max |2001.21|3001.5 |
|Zhang|3111.32|3111.32 |
|Paul |3001.5 |2001.21 |
|Bob |1919.21|1919.21 |
+-----+-------+---------------+
答案 1 :(得分:2)
虽然不能直接对单个列进行随机播放 - 但可以通过RDD
对RandomRDDs
中的记录进行置换。 https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/random/RandomRDDs.html
只有一列置换的潜在方法可能是:
mapPartitions
对每个工作人员任务进行一些设置/拆卸iterator.toList
。 确保您拥有多个(/小)数据分区以避免OOME list.toIterator
mapPartitions
答案 2 :(得分:2)
您可以添加一个随机生成的列,然后根据此随机生成的列对记录进行排序。通过这种方式,您可以随机地移动您的目标列。
通过这种方式,您不需要将所有数据都存储在内存中,这很容易导致OOM。如有必要,Spark将通过溢出到磁盘来处理排序和内存限制问题。
如果您不想要额外的列,可以在排序后将其删除。
答案 3 :(得分:2)
如果您不需要对数据进行全局随机播放,则可以使用mapPartitions
方法在分区内随机播放。
rdd.mapPartitions(Random.shuffle(_));
对于PairRDD
(RDD[(K, V)]
类型的RDD),如果您想要改组键值映射(将任意键映射到任意值):
pairRDD.mapPartitions(iterator => {
val (keySequence, valueSequence) = iterator.toSeq.unzip
val shuffledValueSequence = Random.shuffle(valueSequence)
keySequence.zip(shuffledValueSequence).toIterator
}, true)
末尾的布尔标志表示为此操作保留了分区(密钥未更改),以便下游操作,例如reduceByKey
可以进行优化(避免随机播放)。
答案 4 :(得分:1)
如果有人正在寻找与 Sascha Vetter 的 post 等效的 PySpark,您可以在下面找到它:
from pyspark.sql.functions import rand
from pyspark.sql import Row
from pyspark.sql.types import *
def add_index_to_row(row, index):
print(index)
row_dict = row.asDict()
row_dict["index"] = index
return Row(**row_dict)
def add_index_to_df(df):
df_with_index = df.rdd.zipWithIndex().map(lambda x: add_index_to_row(x[0], x[1]))
new_schema = StructType(df.schema.fields + [StructField("index", IntegerType(), True)])
return spark.createDataFrame(df_with_index, new_schema)
def shuffle_single_column(df, column_name):
df_cols = df.columns
# select the desired column and shuffle it (i.e. order it by column with random numbers)
shuffled_col = df.select(column_name).orderBy(F.rand())
# add explicit index to the shuffled column
shuffled_col_index = add_index_to_df(shuffled_col)
# add explicit index to the original dataframe
df_index = add_index_to_df(df)
# drop the desired column from df, join it with the shuffled column on created index and finally drop the index column
df_shuffled = df_index.drop(column_name).join(shuffled_col_index, "index").drop("index")
# reorder columns so that the shuffled column comes back to its initial position instead of the last position
df_shuffled = df_shuffled.select(df_cols)
return df_shuffled
# initialize random array
z = np.random.randint(20, size=(10, 3)).tolist()
# create the pyspark dataframe
example_df = sc.parallelize(z).toDF(("a","b","c"))
# shuffle one column of the dataframe
example_df_shuffled = shuffle_single_column(df = example_df, column_name = "a")