如何在PySpark中使用公共密钥加入/合并数据帧列表?

时间:2017-06-13 08:42:21

标签: python apache-spark pyspark apache-spark-sql

df1
     uid1  var1
0  John         3
1  Paul         4
2  George       5
df2
     uid1  var2
0  John         23
1  Paul         44
2  George       52
df3
     uid1  var3
0  John         31
1  Paul         45
2  George       53
df_lst=[df1,df2,df3]

如何基于公共密钥uid1合并/加入列表中的3个数据帧?

编辑: 预期产出

   df1
     uid1  var1     var2    var3
0  John         3        23      31
1  Paul         4        44      45
2  George       5        52      53

3 个答案:

答案 0 :(得分:4)

您可以加入数据框列表。下面是一个简单的例子

import spark.implicits._
    val df1 = spark.sparkContext.parallelize(Seq(
      (0,"John",3),
    (1,"Paul",4),
    (2,"George",5)
    )).toDF("id", "uid1", "var1")

    import spark.implicits._
    val df2 = spark.sparkContext.parallelize(Seq(
      (0,"John",23),
      (1,"Paul",44),
      (2,"George",52)
    )).toDF("id", "uid1", "var2")

    import spark.implicits._
    val df3 = spark.sparkContext.parallelize(Seq(
      (0,"John",31),
      (1,"Paul",45),
      (2,"George",53)
    )).toDF("id", "uid1", "var3")


    val df = List(df1, df2, df3)

    df.reduce((a,b) => a.join(b, Seq("id", "uid1")))

输出:

+---+------+----+----+----+
| id|  uid1|var1|var2|var3|
+---+------+----+----+----+
|  1|  Paul|   4|  44|  45|
|  2|George|   5|  52|  53|
|  0|  John|   3|  23|  31|
+---+------+----+----+----+

希望这有帮助!

答案 1 :(得分:1)

让我建议python答案:

from pyspark import SparkContext
SparkContext._active_spark_context.stop()
sc = SparkContext()
sqlcontext = SQLContext(sc)

import pyspark.sql.types as t

rdd_list = [sc.parallelize([('John',i+1),('Paul',i+2),('George',i+3)],1) \
            for i in [100,200,300]]
df_list = []
for i,r in enumerate(rdd_list):
    schema = t.StructType().add('uid1',t.StringType())\
                           .add('var{}'.format(i+1),t.IntegerType())
    df_list.append(sqlcontext.createDataFrame(r, schema))
    df_list[-1].show()
+------+----+
|  uid1|var1|
+------+----+
|  John| 101|
|  Paul| 102|
|George| 103|
+------+----+

+------+----+
|  uid1|var2|
+------+----+
|  John| 201|
|  Paul| 202|
|George| 203|
+------+----+

+------+----+
|  uid1|var3|
+------+----+
|  John| 301|
|  Paul| 302|
|George| 303|
+------+----+
df_res = df_list[0]
for df_next in df_list[1:]:
    df_res = df_res.join(df_next,on='uid1',how='inner')
df_res.show()
+------+----+----+----+
|  uid1|var1|var2|var3|
+------+----+----+----+
|  John| 101| 201| 301|
|  Paul| 102| 202| 302|
|George| 103| 203| 303|
+------+----+----+----+

另一个选择:

def join_red(left,right):
    return left.join(right,on='uid1',how='inner')

res = reduce(join_red, df_list)
res.show()
+------+----+----+----+
|  uid1|var1|var2|var3|
+------+----+----+----+
|  John| 101| 201| 301|
|  Paul| 102| 202| 302|
|George| 103| 203| 303|
+------+----+----+----+

答案 2 :(得分:0)

Mergejoindataframe中的两个不同内容。根据我从你的问题中理解的join将是那个

加入他们

df1.join(df2, df1.uid1 == df2.uid1).join(df3, df1.uid1 == df3.uid1)

应该可以解决问题,但我还建议将columndf2 df3的{​​{1}}名称更改为dataframesuid2,以便这种冲突在未来不会出现