我试图加入来自这两个数据集的数据,基于常见的" stock"键
stock, sector
GOOG Tech
stock, date, volume
GOOG 2015 5759725
join方法应该将它们连接在一起,但是我得到的RDD格式为:
GOOG, (Tech, 2015)
我试图获得:
(Tech, 2015) 5759726
此外,我如何通过密钥减少结果(例如(Tech,2015))以获得每个部门和年份的数字总和?
提前多多感谢!
答案 0 :(得分:0)
希望这有帮助!
from pyspark.sql.functions import struct, col, sum
#sample data
df1 = sc.parallelize([['GOOG', 'Tech'],
['AAPL', 'Tech'],
['XOM', 'Oil']]).toDF(["stock","sector"])
df2 = sc.parallelize([['GOOG', '2015', '5759725'],
['AAPL', '2015', '123'],
['XOM', '2015', '234'],
['XOM', '2016', '789']]).toDF(["stock","date","volume"])
#final output
df = df1.join(df2, ['stock'], 'inner').\
withColumn('sector_year', struct(col('sector'), col('date'))).\
drop('stock','sector','date')
df.show()
#numerical summation for each sector and year
df.groupBy('sector_year').agg(sum('volume')).show()
输出是:
+-------+-----------+
| volume|sector_year|
+-------+-----------+
| 123|[Tech,2015]|
| 234| [Oil,2015]|
| 789| [Oil,2016]|
|5759725|[Tech,2015]|
+-------+-----------+
+-----------+-----------+
|sector_year|sum(volume)|
+-----------+-----------+
|[Tech,2015]| 5759848.0|
| [Oil,2015]| 234.0|
| [Oil,2016]| 789.0|
+-----------+-----------+