spark rdd:分组和过滤

时间:2019-02-22 01:25:54

标签: scala apache-spark rdd

我有一个对象的Rdd“ labResults”:

case class LabResult(patientID: String, date: Long, labName: String, value: String)

我想对这个rdd进行转换,以使其对于每个PatientID和labName组合仅包含一行。此行应该是PatientID和labName组合的最后一行(我只对患者进行此实验室的最新日期感兴趣)。我是这样做的:

//group rows by patient and lab and take only the last one
val cleanLab = labResults.groupBy(x => (x.patientID, x.labName)).map(_._2).map { events =>
  val latest_date = events.maxBy(_.date)
  val lab = events.filter(x=> x.date == latest_date)
  lab.take(1)
}

最后我想从该RDD中创建边缘:

val edgePatientLab: RDD[Edge[EdgeProperty]] = cleanLab
  .map({ lab =>
    Edge(lab.patientID.toLong, lab2VertexId(lab.labName), PatientLabEdgeProperty(lab).asInstanceOf[EdgeProperty])
  })

我收到一个错误:

value patientID is not a member of Iterable[edu.gatech.cse6250.model.LabResult]

[错误]边缘(lab.patientID.toLong,lab2VertexId(lab.labName),PatientLabEdgeProperty(lab).asInstanceOf [EdgeProperty]) [错误] ^ [错误] /hw4/stu_code/src/main/scala/edu/gatech/cse6250/graphconstruct/GraphLoader.scala:94:53:值labName不是Iterable的成员[edu.gatech.cse6250.model.LabResult] [错误] Edge(lab.patientID.toLong,lab2VertexId(lab.labName),PatientLabEdgeProperty(lab).asInstanceOf [EdgeProperty]) [错误] ^ [错误] /hw4/stu_code/src/main/scala/edu/gatech/cse6250/graphconstruct/GraphLoader.scala:94:86:类型不匹配; 发现[错误]:Iterable [edu.gatech.cse6250.model.LabResult] [错误]必需:edu.gatech.cse6250.model.LabResult [错误] Edge(lab.patientID.toLong,lab2VertexId(lab.labName),PatientLabEdgeProperty(lab).asInstanceOf [EdgeProperty])

所以,问题似乎在于“ cleanLab”也不是我期望的LabResult的RDD,而是Iterable [edu.gatech.cse6250.model.LabResult]的RDD。

我该如何解决?

1 个答案:

答案 0 :(得分:0)

这是我第一部分的方法。关于 Edge 和其他类的东西,我不知道,因为我不知道它们来自哪里(是从here来的?)

scala> val ds = List(("1", 1, "A", "value 1"), ("1", 3, "A", "value 3"), ("1", 3, "B", "value 3"), ("1", 2, "A", "value 2"), ("1", 3, "B", "value 3"), ("1", 5, "B", "value 5") ).toDF("patientID", "date", "labName", "value").as[LabResult]
ds: org.apache.spark.sql.Dataset[LabResult] = [patientID: string, date: int ... 2 more fields]

scala> ds.show
+---------+----+-------+-------+
|patientID|date|labName|  value|
+---------+----+-------+-------+
|        1|   1|      A|value 1|
|        1|   3|      A|value 3|
|        1|   3|      B|value 3|
|        1|   2|      A|value 2|
|        1|   3|      B|value 3|
|        1|   5|      B|value 5|
+---------+----+-------+-------+


scala> val grouped = ds.groupBy("patientID", "labName").agg(max("date") as "date")
grouped: org.apache.spark.sql.DataFrame = [patientID: string, labName: string ... 1 more field]

scala> grouped.show
+---------+-------+----+
|patientID|labName|date|
+---------+-------+----+
|        1|      A|   3|
|        1|      B|   5|
+---------+-------+----+


scala> val cleanLab = ds.join(grouped, Seq("patientID", "labName", "date")).as[LabResult]
cleanLab: org.apache.spark.sql.Dataset[LabResult] = [patientID: string, labName: string ... 2 more fields]

scala> cleanLab.show
+---------+-------+----+-------+
|patientID|labName|date|  value|
+---------+-------+----+-------+
|        1|      A|   3|value 3|
|        1|      B|   5|value 5|
+---------+-------+----+-------+


scala> cleanLab.head
res45: LabResult = LabResult(1,3,A,value 3)

scala>