假设我有一个PySpark数据框,其列类型为set:
from pyspark import SparkContext
import pyspark.sql.functions as f
sc = SparkContext()
df = sc.parallelize([
[1, 'A'],
[1, 'B'],
[2, 'A'],
[2, 'C']
]).toDF(('id', 'val'))
df_grp = df.groupBy('id').agg(f.collect_set('val').alias('val_set'))
df_grp.show()
show()
的结果是:
+---+-------+
| id|val_set|
+---+-------+
| 1| [B, A]|
| 2| [C, A]|
+---+-------+
如何只选择val_set
为[B, A]
的行?