Pyspark过滤器具有来自不同数据框的列

时间:2019-06-26 15:11:15

标签: filter pyspark

我想从事件数据框中存在的价格中过滤ID。我的代码在下面,但是在pyspark中不起作用。我该如何解决?

events = spark.createDataFrame([(657,'Conferences'),
                          (765, 'Seminars '),
                          (776, 'Meetings'),
                          (879, 'Conferences'),
                          (765, 'Meetings'),
                          (879, 'Seminars'),
                          (985, 'Meetings'),
                          (879, 'Meetings'),
                          (657, 'Seminars'),
                          (657,'Conferences')]
                         ,['Id', 'event_name'])
events.show()
price = spark.createDataFrame([(657,10),
                          (879,45),
                          (776,54),
                          (879,45),
                          (765, 65)]
                         ,['Id','Price'])


price[price.Id.isin(events.Id)].show()

1 个答案:

答案 0 :(得分:0)

简单的连接将仅获取事件表中存在的ID的价格

events.join(price, "Id").select("Id", "Price").distinct().show()