Pyspark - 从两个不同的数据帧中减去列

时间:2021-06-14 08:42:31

标签: python apache-spark pyspark

我在下面有两个数据框:

df1:                                             df2:
  +------------+------------+-----------+        +-----------+-------------+-----------+    
  |  date      |Advertiser  |Impressions|        |   date    |Advertiser   |Impressions|
  +------------+------------+-----------+        +-----------+-------------+-----------+
  |2020-01-08  |b           |50035      |        | 2020-01-07|b            |10000      |
  |2020-01-08  |c           |70000      |        | 2020-01-07|c            |25260      |
  +------------+------------+-----------+        +-----------+-------------+-----------+ 
  

我想做 df1(Impressions) - df2(Impressions),并将其保存到一个新的数据帧 df3。

  +------------+------------+----------------+               
  |  date      |Advertiser  |diff Impressions|       
  +------------+------------+----------------+        
  |2020-01-08  |b           |40035           |        
  |2020-01-08  |c           |44740           |          
  +------------+------------+----------------+

1 个答案:

答案 0 :(得分:0)

您可以使用广告商列连接两个数据框并进行适当的选择:

df3 = df1.join(df2, 'Advertiser').select(
    df1.date, 
    'Advertiser', 
    (df1.Impressions - df2.Impressions).alias('diff Impressions')
)

df3.show()
+----------+----------+----------------+
|      date|Advertiser|diff Impressions|
+----------+----------+----------------+
|2020-01-08|         b|           40035|
|2020-01-08|         c|           44740|
+----------+----------+----------------+
相关问题