Question

是否可以使用单个滤波器操作将DF分成两部分。例如

让df低于记录

UID    Col
 1       a
 2       b
 3       c

如果我这样做

df1 = df.filter(UID <=> 2)

我可以在单个操作中保存不同RDD中的过滤和非过滤记录？

 df1 can have records where uid = 2
 df2 can have records with uid 1 and 3

Answer 1

如果您只对保存数据感兴趣，可以在DataFrame添加指标列：

val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)

并将其用作DataFrameWriter的分区列，并使用其中一种支持的格式（对于1.6，它是Parquet，text和JSON）：

dfWithInd.write.partitionBy("ind").parquet(...)

它会在写入时创建两个单独的目录（ind=false，ind=true）。

但一般情况下，单个转换不可能产生多个RDDs或DataFrames。见How to split a RDD into two or more RDDs?