Python Spark:如何为每个元组连接包含> 2个元素的2个数据集

时间:2017-12-14 22:19:47

标签: apache-spark pyspark

我试图加入来自这两个数据集的数据,基于常见的" stock"键

stock, sector
GOOG Tech

stock, date, volume
GOOG 2015 5759725

join方法应该将它们连接在一起,但是我得到的RDD格式为:

GOOG, (Tech, 2015)

我试图获得:

(Tech, 2015) 5759726

此外,我如何通过密钥减少结果(例如(Tech,2015))以获得每个部门和年份的数字总和?

提前多多感谢!

1 个答案:

答案 0 :(得分:0)

希望这有帮助!

from pyspark.sql.functions import struct, col, sum

#sample data
df1 = sc.parallelize([['GOOG', 'Tech'],
                      ['AAPL', 'Tech'],
                      ['XOM', 'Oil']]).toDF(["stock","sector"])
df2 = sc.parallelize([['GOOG', '2015', '5759725'],
                      ['AAPL', '2015', '123'],
                      ['XOM',  '2015', '234'],
                      ['XOM',  '2016', '789']]).toDF(["stock","date","volume"])

#final output
df = df1.join(df2, ['stock'], 'inner').\
    withColumn('sector_year', struct(col('sector'), col('date'))).\
    drop('stock','sector','date')
df.show()

#numerical summation for each sector and year
df.groupBy('sector_year').agg(sum('volume')).show()

输出是:

+-------+-----------+
| volume|sector_year|
+-------+-----------+
|    123|[Tech,2015]|
|    234| [Oil,2015]|
|    789| [Oil,2016]|
|5759725|[Tech,2015]|
+-------+-----------+

+-----------+-----------+
|sector_year|sum(volume)|
+-----------+-----------+
|[Tech,2015]|  5759848.0|
| [Oil,2015]|      234.0|
| [Oil,2016]|      789.0|
+-----------+-----------+