我有两个from flask import Flask
app = Flask(__name__)
import view.home
DataFrame
和a
。
b
就像
a
Column 1 | Column 2
abc | 123
cde | 23
就像
b
我想压缩Column 1
1
2
和a
(甚至更多)DataFrames,它们变成了:
b
我该怎么做?
答案 0 :(得分:22)
DataFrame API不支持此类操作。可以zip
两个RDD,但要使其工作,您必须匹配分区数和每个分区的元素数。假设情况如此:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
如果不满足上述条件,我们想到的唯一选择是添加索引并加入:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
答案 1 :(得分:2)
在Scala的Dataframes实现中,没有简单的方法可以将两个数据帧连接成一个。我们可以通过向数据帧的每一行添加索引来解决此限制。然后,我们可以通过这些索引进行内连接。这是我实现的存根代码:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
答案 2 :(得分:1)
纯SQL怎么样?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
答案 3 :(得分:0)
我知道OP正在使用Scala,但是如果像我一样,如果您需要知道如何在pyspark中执行此操作,然后尝试下面的Python代码。像@ zero323的第一个解决方案一样,它依赖于RDD.zip()
,因此,如果两个DataFrame的分区数相同且每个分区中的行数相同,则失败。
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)