如何在Pyspark中逐列连接/附加多个Spark数据帧?

时间:2017-06-02 04:18:01

标签: python apache-spark pyspark apache-spark-sql pyspark-sql

如何使用Pyspark数据帧做一个相当于pd.concat([df1,df2],axis =' columns')的pandas? 我用Google搜索,无法找到一个好的解决方案。

DF1
var1        
     3      
     4      
     5      

DF2
var2    var3     
  23      31
  44      45
  52      53

Expected output dataframe
var1        var2    var3
     3        23      31
     4        44      45
     5        52      53

编辑包含预期输出

4 个答案:

答案 0 :(得分:1)

以下是您想要做的示例,但在scala中,我希望您可以将其转换为pyspark

   function wpb_last_updated_date( $content ) 
{
      $u_time = get_the_time('U'); 
      $u_modified_time = get_the_modified_time('U');

      if ($u_modified_time >= $u_time + 86400) 
      { 
           $updated_date = get_the_modified_time('F jS, Y');
           $updated_time = get_the_modified_time('h:i a'); 
global $post;
      if ($post->post_type == 'post')
           {
                 $custom_content .= '<p class="last-updated"><b>Last updated on</b> '. $updated_date . ' at '. $updated_time .'</p>';  
           }
      }
      $custom_content .= $content;
      return $custom_content;
}
add_filter( 'the_content', 'wpb_last_updated_date' );

这是您只使用数据框

的方法
val spark = SparkSession
    .builder()
    .master("local")
    .appName("ParquetAppendMode")
    .getOrCreate()
  import spark.implicits._

  val df1 = spark.sparkContext.parallelize(Seq(
    (1, "abc"),
    (2, "def"),
    (3, "hij")
  )).toDF("id", "name")

  val df2 = spark.sparkContext.parallelize(Seq(
    (19, "x"),
    (29, "y"),
    (39, "z")
  )).toDF("age", "address")

  val schema = StructType(df1.schema.fields ++ df2.schema.fields)

  val df1df2 = df1.rdd.zip(df2.rdd).map{
    case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}

  spark.createDataFrame(df1df2, schema).show()

将新列添加为import org.apache.spark.sql.functions._ val ddf1 = df1.withColumn("row_id", monotonically_increasing_id()) val ddf2 = df2.withColumn("row_id", monotonically_increasing_id()) val result = ddf1.join(ddf2, Seq("row_id")).drop("row_id") result.show() ,并将所有数据框与密钥row_id一起加入。

希望这有帮助!

答案 1 :(得分:0)

这是我使用@Shankar Koirala的答案在Pyspark中逐列合并2个数据框(不进行合并)的操作

    +---+-----+        +-----+----+       +---+-----+-----+----+
    | id| name|        |secNo|city|       | id| name|secNo|city|
    +---+-----+        +-----+----+       +---+-----+-----+----+
    |  1|sammy|    +   |  101|  LA|   =>  |  1|sammy|  101|  LA|
    |  2| jill|        |  102|  CA|       |  2| jill|  102|  CA|
    |  3| john|        |  103|  DC|       |  3| john|  103|  DC|
    +---+-----+        +-----+----+       +---+-----+-----+----+

这是我的Pyspark代码

    df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
    df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])

    df1 = spark.createDataFrame(df1, schema=df1_schema)

    df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])

    df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
    df2 = spark.createDataFrame(df2, schema=df2_schema)

    df3_schema = StructType(df1.schema.fields + df2.schema.fields)

    def myFunc(x):
      dt1 = x[0]
      dt2 = x[1]

      id = dt1[0]
      name = dt1[1]
      secNo = dt2[0]
      city = dt2[1]

      return [id,name,secNo,city]


    rdd_merged = df1.rdd.zip(df2.rdd).map(lambda x: myFunc(x))

    df3 = spark.createDataFrame(rdd_merged, schema=df3_schema)

请注意,这两个表应具有相同的行数。谢谢“ Shankar Koirala”

答案 2 :(得分:0)

使用pyspark接受的答案的等效值将是

from pyspark.sql.types import StructType

spark = SparkSession.builder().master("local").getOrCreate()
df1 = spark.sparkContext.parallelize([(1, "a"),(2, "b"),(3, "c")]).toDF(["id", "name"])
df2 = spark.sparkContext.parallelize([(7, "x"),(8, "y"),(9, "z")]).toDF(["age", "address"])

schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()

答案 3 :(得分:0)

我已经花了几个小时与PySpark进行此操作,而我的可行解决方案如下: (顺便说一句,在Python中相当于@Shankar Koirala的答案)

from pyspark.sql.functions import monotonically_increasing_id

DF1 = df2.withColumn("row_id", monotonically_increasing_id())
DF2 = df3.withColumn("row_id", monotonically_increasing_id())
result_df = DF1.join(DF2, ("row_id")).drop("row_id")

您只需为两个数据框定义一个公共列,然后在合并后立即删除该列。我希望这种解决方案在数据框不包含任何公共列的情况下有帮助。

但是,此方法会随机连接数据框行,请记住一个细节。