是否可以groupByKey两个不同类的DataSet,结果是
键 - >数组([Class1实例],[Class2实例],[Class2 实例])
为了澄清这个问题,这里是简单的scala代码。
object DataSetGrouping {
import org.apache.spark.sql.SparkSession
import java.sql.Timestamp
case class Loan(loanId: String, principalAmount: Double)
case class Payment(loanId: String, paymentAmount: Double, paymentDate: Timestamp)
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").appName("DataSetGrouping").getOrCreate()
import spark.implicits._
val loanData = Seq(
Loan("loan1", 30000),
Loan("loan2", 60000)).toDS()
val paymentsData = Seq(
Payment("loan1", 10000, date("2017-07-31")),
Payment("loan1", 10000, date("2017-08-31")),
Payment("loan2", 20000, date("2017-07-31")),
Payment("loan2", 20000, date("2017-08-31"))).toDS()
val paymentMap = paymentsData.map(p => (p.loanId, p))
val loanMap = loanData.map(l => (l.loanId, l))
paymentMap.show()
loanMap.show()
}
def date(date: String): Timestamp = {
return java.sql.Timestamp.valueOf(java.time.LocalDateTime.parse(date + "T00:00:00"))
}
}
是否可以对这两个数据组进行分组,以便结果如下?
loan1 - > [贷款(" loan1",...),付款(" loan1",......), 付款(" loan1",...)],
loan2 - > [贷款(" loan2",...), 付款(" loan2",...),付款(" loan2",......)]
答案 0 :(得分:1)
如果不处理Kryo Encoder
和Any
,最接近的事情可能是这样的:
paymentsData.groupByKey(_.loanId).mapGroups {
case (id, xs) => (id, xs.toSeq)
}.toDF("loanID", "payments").join(loanData, Seq("loanID"))
+------+------------------------------------------------------------------------------+---------------+
|loanID|payments |principalAmount|
+------+------------------------------------------------------------------------------+---------------+
|loan1 |[[loan1,10000.0,2017-07-31 00:00:00.0], [loan1,10000.0,2017-08-31 00:00:00.0]]|30000.0 |
|loan2 |[[loan2,20000.0,2017-07-31 00:00:00.0], [loan2,20000.0,2017-08-31 00:00:00.0]]|60000.0 |
+------+------------------------------------------------------------------------------+---------------+
由于分组而非常昂贵。