df1
uid1 var1
0 John 3
1 Paul 4
2 George 5
df2
uid1 var2
0 John 23
1 Paul 44
2 George 52
df3
uid1 var3
0 John 31
1 Paul 45
2 George 53
df_lst=[df1,df2,df3]
如何基于公共密钥uid1合并/加入列表中的3个数据帧?
编辑: 预期产出
df1
uid1 var1 var2 var3
0 John 3 23 31
1 Paul 4 44 45
2 George 5 52 53
答案 0 :(得分:4)
您可以加入数据框列表。下面是一个简单的例子
import spark.implicits._
val df1 = spark.sparkContext.parallelize(Seq(
(0,"John",3),
(1,"Paul",4),
(2,"George",5)
)).toDF("id", "uid1", "var1")
import spark.implicits._
val df2 = spark.sparkContext.parallelize(Seq(
(0,"John",23),
(1,"Paul",44),
(2,"George",52)
)).toDF("id", "uid1", "var2")
import spark.implicits._
val df3 = spark.sparkContext.parallelize(Seq(
(0,"John",31),
(1,"Paul",45),
(2,"George",53)
)).toDF("id", "uid1", "var3")
val df = List(df1, df2, df3)
df.reduce((a,b) => a.join(b, Seq("id", "uid1")))
输出:
+---+------+----+----+----+
| id| uid1|var1|var2|var3|
+---+------+----+----+----+
| 1| Paul| 4| 44| 45|
| 2|George| 5| 52| 53|
| 0| John| 3| 23| 31|
+---+------+----+----+----+
希望这有帮助!
答案 1 :(得分:1)
让我建议python答案:
from pyspark import SparkContext
SparkContext._active_spark_context.stop()
sc = SparkContext()
sqlcontext = SQLContext(sc)
import pyspark.sql.types as t
rdd_list = [sc.parallelize([('John',i+1),('Paul',i+2),('George',i+3)],1) \
for i in [100,200,300]]
df_list = []
for i,r in enumerate(rdd_list):
schema = t.StructType().add('uid1',t.StringType())\
.add('var{}'.format(i+1),t.IntegerType())
df_list.append(sqlcontext.createDataFrame(r, schema))
df_list[-1].show()
+------+----+
| uid1|var1|
+------+----+
| John| 101|
| Paul| 102|
|George| 103|
+------+----+
+------+----+
| uid1|var2|
+------+----+
| John| 201|
| Paul| 202|
|George| 203|
+------+----+
+------+----+
| uid1|var3|
+------+----+
| John| 301|
| Paul| 302|
|George| 303|
+------+----+
df_res = df_list[0]
for df_next in df_list[1:]:
df_res = df_res.join(df_next,on='uid1',how='inner')
df_res.show()
+------+----+----+----+
| uid1|var1|var2|var3|
+------+----+----+----+
| John| 101| 201| 301|
| Paul| 102| 202| 302|
|George| 103| 203| 303|
+------+----+----+----+
另一个选择:
def join_red(left,right):
return left.join(right,on='uid1',how='inner')
res = reduce(join_red, df_list)
res.show()
+------+----+----+----+
| uid1|var1|var2|var3|
+------+----+----+----+
| John| 101| 201| 301|
| Paul| 102| 202| 302|
|George| 103| 203| 303|
+------+----+----+----+
答案 2 :(得分:0)
Merge
和join
是dataframe
中的两个不同内容。根据我从你的问题中理解的join
将是那个
加入他们
df1.join(df2, df1.uid1 == df2.uid1).join(df3, df1.uid1 == df3.uid1)
应该可以解决问题,但我还建议将column
和df2
df3
的{{1}}名称更改为dataframes
和uid2
,以便这种冲突在未来不会出现