Question

假设我们有一个类似的数据框：

col1 | col2 | col3
A    |  B   |  C
D    |  F   |  C
G    |  H   |  I
Z    |  X   |  V
Q    |  R   |  V

现在，例如，我需要通过唯一的Col3值拆分数据框，这样我就可以得到DataFrame 1：

col1 | col2 | col3
A    |  B   |  C
D    |  F   |  C

Dataframe2像这样：

col1 | col2 | col3
G    |  H   |  I

Dataframe3像这样：

col1 | col2 | col3
Z    |  X   |  V
Q    |  R   |  V

现在我有一个这些唯一col3值的列表，我循环遍历它们来过滤数据帧，就像这样（伪代码）：

list = C,I,V
for (int i = 0; i < list.length; i++){
processDF = dataframe.filter(col3=list(i))
process(processDF)
}

这似乎不是一个好方法，因为我首先扫描初始数据帧，按值过滤然后处理该数据帧，然后返回扫描列表中第二个元素的初始数据帧，然后处理该数据帧，等等。还有另一种方法可以在一次扫描时同时创建这些数据帧，从而提高性能。

Answer 1

因为您要求使用DataFrame。

<强>更新

scala> case class test(col1: String, col2: String, col3: String)
defined class test

scala> val listoftest = List(test("G", "H", "I"), test("A", "B", "C"), 
                             test("D", "F", "C"), test("Z", "X", "V"), 
                             test("Q", "R", "V"))
listoftest: List[test] = List(test(G,H,I), test(A,B,C), test(D,F,C), 
                              test(Z,X,V), test(Q,R,V))

scala> print(listoftest)
List(test(G,H,I), test(A,B,C), test(D,F,C), test(Z,X,V), test(Q,R,V))


scala> val df = sc.parallelize(listoftest).toDF
scala> df.rdd.groupBy(r => r.get(2))
res12: org.apache.spark.rdd.RDD[(Any, Iterabl[org.apache.spark.sql.Row])]   
= ShuffledRDD[13] at groupBy at <console>:31

scala> res12.collect.foreach(x => println(x))
(I,CompactBuffer([G,H,I]))
(V,CompactBuffer([Z,X,V], [Q,R,V]))
(C,CompactBuffer([A,B,C], [D,F,C]))

使用数据帧过滤方法的替代方法

1 个答案: