我有一个Spark数据框,要在上面进行分层拆分。我参考了SO帖子https://stackoverflow.com/a/50476540/5743766中给出的答案来进行拆分。
对于越来越大的数据集(5万行),当我尝试在本地计算机上进行拆分时,它可以正常工作。创建两个数据集,这些数据集在拆分中不重复行,并保持目标列的比率。
但是,当我们将代码部署到集群中时,对于较大的数据集,其表现却很奇怪。拆分数据集中的行总数是正确的(当添加拆分数据集的行数时,其总和等于原始数据集的行数),但是一个拆分数据集的某些行(50k中约9k)被重复另一行中的某些行(〜9k)已从原始数据集中滤除。
我为这个问题找到的解决方案是在进行拆分以删除行重复之前缓存数据集。以下是使用缓存和不使用缓存时拆分数据集的物理计划:
With cache - split dataframe
== Physical Plan ==
*(1) Project [ID#0, TARGET_COLUMN#1]
+- *(1) Filter ((isnotnull(__split__ratio#661) && (__split__ratio#661 > 0.3)) && (__split__ratio#661 <= 1.0))
+- *(1) InMemoryTableScan [ID#0, TARGET_COLUMN#1, __split__ratio#661], [isnotnull(__split__ratio#661), (__split__ratio#661 > 0.3), (__split__ratio#661 <= 1.0)]
+- InMemoryRelation [ID#0, TARGET_COLUMN#1, __split__count#650L, __split__rowNumber#655, __split__ratio#661], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(4) Project [ID#0, TARGET_COLUMN#1, __split__count#650L, __split__rowNumber#655, (cast(__split__rowNumber#655 as double) / cast(CASE WHEN NOT (__split__count#650L = 0) THEN __split__count#650L ELSE 1 END as double)) AS __split__ratio#661]
+- Window [row_number() windowspecdefinition(TARGET_COLUMN#1, _w0#656 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __split__rowNumber#655], [TARGET_COLUMN#1], [_w0#656 ASC NULLS FIRST]
+- *(3) Sort [TARGET_COLUMN#1 ASC NULLS FIRST, _w0#656 ASC NULLS FIRST], false, 0
+- *(3) Project [ID#0, TARGET_COLUMN#1, __split__count#650L, rand(1618) AS _w0#656]
+- Window [count(TARGET_COLUMN#1) windowspecdefinition(TARGET_COLUMN#1, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __split__count#650L], [TARGET_COLUMN#1]
+- *(2) Sort [TARGET_COLUMN#1 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(TARGET_COLUMN#1, 200)
+- *(1) FileScan parquet [ID#0,TARGET_COLUMN#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://localhost:54310/test/serverPartition.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ID:int,TARGET_COLUMN:int>
Without cache - split dataframe
== Physical Plan ==
*(4) Project [ID#0, TARGET_COLUMN#1]
+- *(4) Filter ((isnotnull(__split__rowNumber#655) && ((cast(__split__rowNumber#655 as double) / cast(CASE WHEN NOT (__split__count#650L = 0) THEN __split__count#650L ELSE 1 END as double)) > 0.3)) && ((cast(__split__rowNumber#655 as double) / cast(CASE WHEN NOT (__split__count#650L = 0) THEN __split__count#650L ELSE 1 END as double)) <= 1.0))
+- Window [row_number() windowspecdefinition(TARGET_COLUMN#1, _w0#656 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS __split__rowNumber#655], [TARGET_COLUMN#1], [_w0#656 ASC NULLS FIRST]
+- *(3) Sort [TARGET_COLUMN#1 ASC NULLS FIRST, _w0#656 ASC NULLS FIRST], false, 0
+- *(3) Project [ID#0, TARGET_COLUMN#1, __split__count#650L, rand(1618) AS _w0#656]
+- Window [count(TARGET_COLUMN#1) windowspecdefinition(TARGET_COLUMN#1, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS __split__count#650L], [TARGET_COLUMN#1]
+- *(2) Sort [TARGET_COLUMN#1 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(TARGET_COLUMN#1, 200)
+- *(1) FileScan parquet [ID#0,TARGET_COLUMN#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://localhost:54310/test/serverPartition.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ID:int,TARGET_COLUMN:int>
执行拆分的代码:
val targetColumn = "TARGET_COLUMN"
val partitionWindow = Window.partitionBy(targetColumn).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val windowWithRandomOrdering = Window.partitionBy(targetColumn).orderBy(Random.nextLong())
val resultDF = dataFrame
.withColumn("__split__count", count(col(targetColumn)).over(partitionWindow)) // column which holds count of number of rows per partition
.withColumn("__split__rowNumber", row_number().over(windowWithRandomOrdering)) // row number within partition
.withColumn("__split__ratio", col("__split__rowNumber") / when(col("__split__count" =!= 0, col("__split__count")).otherwise(lit(1)))
// Following line magically removed the duplication of the rows across splits
resultDF.persist(StorageLevel.MEMORY_AND_DISK)
val splitRatios = List(0.3, 0.7)
val (splits, _) = splitRatios.foldLeft((List.empty[DataFrame], 0.0)) {
case ((dataFrames, prevRatio), ratio) =>
val filteredDF = resultDF
.filter(col("__split__ratio") > prevRatio && col("__split__ratio") <= prevRatio + ratio)
.drop("__split__ratio")
.drop("__split__count")
.drop("__split__rowNumber")
(dataFrames :+ filteredDF, prevRatio + ratio)
}
// write the dataframes
有人可以解释为什么在不缓存数据集时,为什么某些行被过滤掉,有些行是重复的,以及缓存如何帮助解决此问题?
注意:对于行数在5k左右的较小文件,群集中不会发生此问题;对于较大或较小文件,该文件在本地计算机中永远不会发生。