基于公共列的两个pyspark数据帧的交集和并集

时间:2018-05-21 07:54:39

标签: python apache-spark pyspark

我有两个pyspark数据帧,A&乙

A有两列date, symbol B有两列date2 entity

我只想根据日期获得这两个df的unionintersection

例如,如果df A为

+----------+------+
|      date|symbol|
+----------+------+
|2013-08-30|     A|
|2013-08-30|   AAL|
|2013-08-30|   AAP|
|2013-08-30|  AAPL|
|2013-08-30|  ABBV|
+----------+------+

和B as:

+----------+-------------+
|       day|entity_ticker|
+----------+-------------+
|2013-08-30|            A|
|2013-08-30|          AAL|
|2013-08-30|          AAP|
|2013-08-30|         AAPL|
|2013-08-30|          ABC|
+----------+-------------+

我只想要工会

+----------+--------------------------------+
|       dd |union_of_sybols                 |
+----------+--------------------------------+
|2013-08-30|            [A,AAL,AAP,AAPL,ABBV,ABC]|
+----------+--------------------------------+

和交叉点:

+----------+--------------------------------+
|       dd |intersection_of_sybols          |
+----------+--------------------------------+
|2013-08-30|            [A,AAL,AAP,AAPL]    |
+----------+--------------------------------+

提前致谢

1 个答案:

答案 0 :(得分:1)

您可以通过数据框的unionintersect功能获益。在unionintersect之后,最后一步将是groupBy并使用collect_set 内置函数作为聚合

对于工会

from pyspark.sql import functions as f
#union of two dataframes
A.union(B).groupBy(f.col('date').alias('dd')).agg(f.collect_set('symbol').alias('union_of_symbols')).show(truncate=False)

应该给你

+----------+------------------------------+
|dd        |union_of_symbols              |
+----------+------------------------------+
|2013-08-30|[AAL, AAP, ABC, A, AAPL, ABBV]|
+----------+------------------------------+

对于交叉点

#intersection of two dataframes
A.intersect(B).groupBy(f.col('date').alias('dd')).agg(f.collect_set('symbol').alias('intersection_of_symbols')).show(truncate=False)

应该给你

+----------+-----------------------+
|dd        |intersection_of_symbols|
+----------+-----------------------+
|2013-08-30|[AAL, AAP, A, AAPL]    |
+----------+-----------------------+