Pyspark使用groupBy作为查找-TypeError:'Column'对象不可调用

时间:2019-07-13 11:33:57

标签: dataframe group-by pyspark aggregate-functions

我想向pyspark数据帧(df1)添加一个新列,其中包含来自另一个数据帧(df2)的汇总信息。

df1.show()

+----------------+
|   name         |
+----------------+
|     000097     |
|     000097     |
|     000098     |
+----------------+

df2.show()

+----------------+----------------+
|   name         |    id          |
+----------------+----------------+
|     000097     |     1          |
|     000097     |     2          |
|     000098     |     1          |
|     000098     |     2          |
|     000098     |     3          |
+----------------+----------------+

应该导致

df1_new.show()

+----------------+----------------+
|   name         |    id_set      |
+----------------+----------------+
|     000097     |     [1,2]      |
|     000097     |     [1,2]      |
|     000098     |     [1,2,3]    |
+----------------+----------------+

我创建了一个查找:

lookup_set = df1.join(df2, ['name'], "left").groupBy('name').agg(collect_set("id").alias("id_set"))

lookup_set.show()

+----------------+----------------+
|   name         |    id_set      |
+----------------+----------------+
|     000097     |     [1,2]      |
|     000098     |     [1,2,3]    |
+----------------+----------------+

但是当我尝试访问查询时:

lookup_set["name"].show()

lookup_set["id_set"].where(lookup_set["name"] == "000097")

我得到了错误:

TypeError: 'Column' object is not callable

我在这里做什么错了?

1 个答案:

答案 0 :(得分:2)

您正在将Spark DataFrame视为Pandas DataFrame,这会导致错误。

如果要显示单个列,请使用选择并传递要查看的列列表

lookup_set["name"].show()将是 lookup_set.select("name").show()

lookup_set["id_set"].where(lookup_set["name"] == "000097")

应该是

lookup_set.select("id_set").where(lookup_set["name"] == "000097").show()