如何在Spark中压缩两个(或更多)DataFrame

时间:2015-10-01 08:08:16

标签: scala apache-spark dataframe apache-spark-sql

我有两个from flask import Flask app = Flask(__name__) import view.home DataFrameab就像

a

Column 1 | Column 2 abc | 123 cde | 23 就像

b

我想压缩Column 1 1 2 a(甚至更多)DataFrames,它们变成了:

b

我该怎么做?

4 个答案:

答案 0 :(得分:22)

DataFrame API不支持此类操作。可以zip两个RDD,但要使其工作,您必须匹配分区数和每个分区的元素数。假设情况如此:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}

val a: DataFrame = sc.parallelize(Seq(
  ("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")

// Merge rows
val rows = a.rdd.zip(b.rdd).map{
  case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}

// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)

// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)

如果不满足上述条件,我们想到的唯一选择是添加索引并加入:

def addIndex(df: DataFrame) = sqlContext.createDataFrame(
  // Add index
  df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
  // Create schema
  StructType(df.schema.fields :+ StructField("_index", LongType, false))
)

// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)

// Join and clean
val ab = aWithIndex
  .join(bWithIndex, Seq("_index"))
  .drop("_index")

答案 1 :(得分:2)

在Scala的Dataframes实现中,没有简单的方法可以将两个数据帧连接成一个。我们可以通过向数据帧的每一行添加索引来解决此限制。然后,我们可以通过这些索引进行内连接。这是我实现的存根代码:

val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)

val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)

aWithId.join(bWithId, "id")

A little light reading - Check out how Python does this!

答案 2 :(得分:1)

纯SQL怎么样?

SELECT 
    room_name, 
    sender_nickname, 
    message_id, 
    row_number() over (partition by room_name order by message_id) as message_index, 
    row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id

答案 3 :(得分:0)

我知道OP正在使用Scala,但是如果像我一样,如果您需要知道如何在pyspark中执行此操作,然后尝试下面的Python代码。像@ zero323的第一个解决方案一样,它依赖于RDD.zip(),因此,如果两个DataFrame的分区数相同且每个分区中的行数相同,则失败。

from pyspark.sql import Row
from pyspark.sql.types import StructType

def zipDataFrames(left, right):
    CombinedRow = Row(*left.columns + right.columns)

    def flattenRow(row):
        left = row[0]
        right = row[1]
        combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
        return CombinedRow(*combinedVals)

    zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))        
    combinedSchema = StructType(left.schema.fields + right.schema.fields)        
    return zippedRdd.toDF(combinedSchema)

joined = zipDataFrames(a, b)