Pyspark:按数组类型的列过滤

时间:2020-06-03 17:36:33

标签: pyspark apache-spark-sql pyspark-dataframes

我有一个像这样的数据框:

df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-12", ], "Mode": ['A', 'B', 'A', 'C', 'C', 'B'], "set(Mode)": [['A', 'B'], ['A', 'B'], ['A', 'B'], ['C'], ['C'], ['B']]})

df = spark.createDataFrame(df)

+----------+----+---------+
|      Date|Mode|set(Mode)|
+----------+----+---------+
|2020-05-10|   A|   [A, B]|
|2020-05-10|   B|   [A, B]|
|2020-05-10|   A|   [A, B]|
|2020-05-11|   C|      [C]|
|2020-05-11|   C|      [C]|
|2020-05-12|   B|      [B]|
+----------+----+---------+

我想按列set(Mode)进行过滤以获取如下所示的数据框:

+----------+----+---------+
|      Date|Mode|set(Mode)|
+----------+----+---------+
|2020-05-10|   A|   [A, B]|
|2020-05-10|   B|   [A, B]|
|2020-05-10|   A|   [A, B]|
+----------+----+---------+

但是当我这样尝试过滤器时:

df.filter(F.col('set(Mode)') == ['A', 'B'])

我收到以下错误:

An error occurred while calling o1789.equalTo.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [A, B]

0 个答案:

没有答案