在Spark

时间:2017-08-10 21:03:26

标签: scala apache-spark dataset

是否可以groupByKey两个不同类的DataSet,结果是

  

键 - >数组([Class1实例],[Class2实例],[Class2   实例])

为了澄清这个问题,这里是简单的scala代码。

object DataSetGrouping {

  import org.apache.spark.sql.SparkSession
  import java.sql.Timestamp

  case class Loan(loanId: String, principalAmount: Double)
  case class Payment(loanId: String, paymentAmount: Double, paymentDate: Timestamp)

  def main(args: Array[String]) {

    val spark = SparkSession.builder().master("local").appName("DataSetGrouping").getOrCreate()
    import spark.implicits._

    val loanData = Seq(
      Loan("loan1", 30000),
      Loan("loan2", 60000)).toDS()

    val paymentsData = Seq(
      Payment("loan1", 10000, date("2017-07-31")),
      Payment("loan1", 10000, date("2017-08-31")),
      Payment("loan2", 20000, date("2017-07-31")),
      Payment("loan2", 20000, date("2017-08-31"))).toDS()

    val paymentMap = paymentsData.map(p => (p.loanId, p))
    val loanMap = loanData.map(l => (l.loanId, l))

    paymentMap.show()
    loanMap.show()
  }

  def date(date: String): Timestamp = {
    return java.sql.Timestamp.valueOf(java.time.LocalDateTime.parse(date + "T00:00:00"))
  }

}

是否可以对这两个数据组进行分组,以便结果如下?

  

loan1 - > [贷款(" loan1",...),付款(" loan1",......),   付款(" loan1",...)],

     

loan2 - > [贷款(" loan2",...),   付款(" loan2",...),付款(" loan2",......)]

1 个答案:

答案 0 :(得分:1)

如果不处理Kryo EncoderAny,最接近的事情可能是这样的:

paymentsData.groupByKey(_.loanId).mapGroups { 
  case (id, xs) => (id, xs.toSeq) 
}.toDF("loanID", "payments").join(loanData, Seq("loanID"))

+------+------------------------------------------------------------------------------+---------------+
|loanID|payments                                                                      |principalAmount|
+------+------------------------------------------------------------------------------+---------------+
|loan1 |[[loan1,10000.0,2017-07-31 00:00:00.0], [loan1,10000.0,2017-08-31 00:00:00.0]]|30000.0        |
|loan2 |[[loan2,20000.0,2017-07-31 00:00:00.0], [loan2,20000.0,2017-08-31 00:00:00.0]]|60000.0        |
+------+------------------------------------------------------------------------------+---------------+

由于分组而非常昂贵。