我想向pyspark数据帧(df1)添加一个新列,其中包含来自另一个数据帧(df2)的汇总信息。
df1.show()
+----------------+
| name |
+----------------+
| 000097 |
| 000097 |
| 000098 |
+----------------+
df2.show()
+----------------+----------------+
| name | id |
+----------------+----------------+
| 000097 | 1 |
| 000097 | 2 |
| 000098 | 1 |
| 000098 | 2 |
| 000098 | 3 |
+----------------+----------------+
应该导致
df1_new.show()
+----------------+----------------+
| name | id_set |
+----------------+----------------+
| 000097 | [1,2] |
| 000097 | [1,2] |
| 000098 | [1,2,3] |
+----------------+----------------+
我创建了一个查找:
lookup_set = df1.join(df2, ['name'], "left").groupBy('name').agg(collect_set("id").alias("id_set"))
lookup_set.show()
+----------------+----------------+
| name | id_set |
+----------------+----------------+
| 000097 | [1,2] |
| 000098 | [1,2,3] |
+----------------+----------------+
但是当我尝试访问查询时:
lookup_set["name"].show()
或
lookup_set["id_set"].where(lookup_set["name"] == "000097")
我得到了错误:
TypeError: 'Column' object is not callable
我在这里做什么错了?
答案 0 :(得分:2)
您正在将Spark DataFrame视为Pandas DataFrame,这会导致错误。
如果要显示单个列,请使用选择并传递要查看的列列表
lookup_set["name"].show()
将是
lookup_set.select("name").show()
lookup_set["id_set"].where(lookup_set["name"] == "000097")
应该是
lookup_set.select("id_set").where(lookup_set["name"] == "000097").show()