在Spark DataSet创建中使用Java Domain对象而不是Scala Case Class

时间:2016-10-19 18:47:22

标签: scala apache-spark

我正在尝试使用RDD#toDS方法从RDD创建Spark DataSet。

但是,我想使用第三方库中定义的现有域对象,而不是使用Scala案例类来指定架构。但是,当我这样做时,我收到以下错误:

scala> import org.hl7.fhir.dstu3.model.Patient
import org.hl7.fhir.dstu3.model.Patient

scala> val patients = sc.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://mongodb/fhir.patients")))
patients: com.mongodb.spark.rdd.MongoRDD[org.bson.Document] = MongoRDD[0] at RDD at MongoRDD.scala:47

scala> val patientsDataSet = patients.toDS[Patient]()
<console>:44: error: not enough arguments for method toDS: (beanClass: Class[org.hl7.fhir.dstu3.model.Patient])org.apache.spark.sql.Dataset[org.hl7.fhir.dstu3.model.Patient].
Unspecified value parameter beanClass.
         val patientsDataSet = patients.toDS[Patient]()
                                                     ^

这是我删除括号时得到的结果:

scala> val patientsDataSet = patients.toDS[Patient]
<console>:46: error: missing arguments for method toDS in class MongoRDD;
follow this method with `_' if you want to treat it as a partially applied function
         val patientsDataSet = patients.toDS[Patient]

无论如何,我可以在这里使用Java对象来代替案例类吗?

谢谢!

1 个答案:

答案 0 :(得分:0)

可能会创建一个扩展java对象的case类。

爪哇:

public class Patient {

  private final String name;
  private final String status;

  public Patient(String name, String status) {
    this.name = name;
    this.status = status;
  }

  public String getName() {
    return name;
  }

  public String getStatus() {
    return status;
  }

}

Scala的:

    case class Patient0(name: String, status: String) extends Patient(name, status)
val patientsDataSet = patients.toDS[Patient]()
val patients = sc.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://mongodb/fhir.patients")))