如何使用Spark RDD将一个文件的列添加到其他文件?
File1输入包含以下输入:
Priority,qty,sales
Low,6,261.54
High,44,10123.02
High,27,244.57
High,30,4965.75
Null,22,394.27
File2包含以下输入:
priority,grade
Low,A
High,B
Null,K
我所需的输出应为:
Priority,qty,sales,grade
Low,6,261.54,A
High,44,10123.02,B
High,27,244.57,B
High,30,4965.75,A
Null,22,394.27,K
答案 0 :(得分:0)
Rdd解决方案:这是它的RDD解决方案。使用left outer join
。
rdd = sc.parallelize([('Low',6,261.54),('High',44,10123.02),('High',27,244.57),
('High',30,4965.75),('Null',22,394.27)]).map(lambda x:(x[0],(x[1],x[2])))
rdd.collect()
[('Low', (6, 261.54)),
('High', (44, 10123.02)),
('High', (27, 244.57)),
('High', (30, 4965.75)),
('Null', (22, 394.27))]
rdd1 = sc.parallelize([('Low','A'),('High','B'),('Null','K')])
rdd1.collect()
[('Low', 'A'), ('High', 'B'), ('Null', 'K')]
rdd2=rdd.leftOuterJoin(rdd1).map(lambda x:(x[0],x[1][0][0],x[1][0][1],x[1][1]))
rdd2.collect()
[('High', 27, 244.57, 'B'),
('High', 30, 4965.75, 'B'),
('High', 44, 10123.02, 'B'),
('Low', 6, 261.54, 'A'),
('Null', 22, 394.27, 'K')]
DataFrame解决方案::您可以使用left join
来执行此操作。我假设Null
是string
而不是None
。
# Creating the DataFrames
df = sqlContext.createDataFrame([('Low',6,261.54),('High',44,10123.02),('High',27,244.57),
('High',30,4965.75),('Null',22,394.27)],
['Priority','qty','sales'])
df.show()
+--------+---+--------+
|Priority|qty| sales|
+--------+---+--------+
| Low| 6| 261.54|
| High| 44|10123.02|
| High| 27| 244.57|
| High| 30| 4965.75|
| Null| 22| 394.27|
+--------+---+--------+
df1 = sqlContext.createDataFrame([('Low','A'),('High','B'),('Null','K')],
['Priority','grade'])
df1.show()
+--------+-----+
|Priority|grade|
+--------+-----+
| Low| A|
| High| B|
| Null| K|
+--------+-----+
应用left
联接。
df_joined = df.join(df1,['Priority'],how='left')
df_joined.show()
+--------+---+--------+-----+
|Priority|qty| sales|grade|
+--------+---+--------+-----+
| High| 44|10123.02| B|
| High| 27| 244.57| B|
| High| 30| 4965.75| B|
| Low| 6| 261.54| A|
| Null| 22| 394.27| K|
+--------+---+--------+-----+
答案 1 :(得分:0)
似乎您只是在尝试使用file2
列作为键将file1
联接到priority
。在Spark中,您可以使用比RDD更为方便的数据框。只是几行代码。
file1 = spark.read.option("header", "true").csv(".../file1")
file2 = spark.read.option("header", "true").csv(".../file2")
output = file1.join(file2, ['priority'])
output.show()
+--------+---+--------+-----+
|Priority|qty| sales|grade|
+--------+---+--------+-----+
| Low| 6| 261.54| A|
| High| 44|10123.02| B|
| High| 27| 244.57| B|
| High| 30| 4965.75| B|
| Null| 22| 394.27| K|
+--------+---+--------+-----+
如果您想将其写入磁盘,
output.write.option("header", "true").csv(".../output")