我有两个pyspark数据帧,A&乙
A有两列date, symbol
B有两列date2 entity
我只想根据日期获得这两个df的union
和intersection
例如,如果df A为
+----------+------+
| date|symbol|
+----------+------+
|2013-08-30| A|
|2013-08-30| AAL|
|2013-08-30| AAP|
|2013-08-30| AAPL|
|2013-08-30| ABBV|
+----------+------+
和B as:
+----------+-------------+
| day|entity_ticker|
+----------+-------------+
|2013-08-30| A|
|2013-08-30| AAL|
|2013-08-30| AAP|
|2013-08-30| AAPL|
|2013-08-30| ABC|
+----------+-------------+
我只想要工会
+----------+--------------------------------+
| dd |union_of_sybols |
+----------+--------------------------------+
|2013-08-30| [A,AAL,AAP,AAPL,ABBV,ABC]|
+----------+--------------------------------+
和交叉点:
+----------+--------------------------------+
| dd |intersection_of_sybols |
+----------+--------------------------------+
|2013-08-30| [A,AAL,AAP,AAPL] |
+----------+--------------------------------+
提前致谢
答案 0 :(得分:1)
您可以通过数据框的union
和intersect
功能获益。在union
或intersect
之后,最后一步将是groupBy
并使用collect_set
内置函数作为聚合
对于工会
from pyspark.sql import functions as f
#union of two dataframes
A.union(B).groupBy(f.col('date').alias('dd')).agg(f.collect_set('symbol').alias('union_of_symbols')).show(truncate=False)
应该给你
+----------+------------------------------+
|dd |union_of_symbols |
+----------+------------------------------+
|2013-08-30|[AAL, AAP, ABC, A, AAPL, ABBV]|
+----------+------------------------------+
对于交叉点
#intersection of two dataframes
A.intersect(B).groupBy(f.col('date').alias('dd')).agg(f.collect_set('symbol').alias('intersection_of_symbols')).show(truncate=False)
应该给你
+----------+-----------------------+
|dd |intersection_of_symbols|
+----------+-----------------------+
|2013-08-30|[AAL, AAP, A, AAPL] |
+----------+-----------------------+