Spark Dataframe到Java类的数据集

时间:2017-01-26 15:28:31

标签: java scala apache-spark

我想将作为Json读入的Dataframe转换为给定类的数据集。到目前为止,当我能够编写自己的案例类时,这非常有效。

<div class="container">
 <h2>Add Wine</h2>
   <form id="nameFG"  name="nameFG"  ng-submit="wAw.submitWine();" >
    <div class="form-group">
      <label for="usr">Wine:</label>
      <input type="text" class="form-control" id="name" name="name" ng-model="wAw.newWine.name">
    </div>
    <div class="form-group">
      <label for="usr">Year:</label>
      <input type="text" class="form-control" id="year" name="year" ng-model="wAw.newWine.year">
    </div>
    <div class="form-group">
      <label for="usr">Grapes:</label>
      <input type="text" class="form-control" id="grapes" name="grapes" ng-model="wAw.newWine.grapes">
    </div>
    <div class="form-group">
      <label for="usr">Country:</label>
      <input type="text" class="form-control" id="country" name="country" ng-model="wAw.newWine.country">
    </div>
    <div class="form-group">
      <label for="usr">Region:</label>
      <input type="text" class="form-control" id="region" name="region" ng-model="wAw.newWine.region">
    </div>

    <div class="form-group">
      
      <button type="submit" class="form-control btn-warning" id="region">Submit</button>
    </div>
  </form>
</div>

但是,现在我被绑定到外部Java类(特别是由thrift创建的类)。所以这里有一个更具体的自定义类示例:

JSON:

case class MyCaseClass(...)
val df = spark.read.json("path/to/json")
val ds = df.as[MyCaseClass]

def myFunction(input: MyCaseClass): MyCaseClass = {
    // Do some validation and things
    input
}

ds.map(myFunction)

类别:

{"a":1,"b":"1","wrapper":{"inside":"1.1", "map": {"k": "v"}}}
{"a":2,"b":"2","wrapper":{"inside":"2.1", "map": {"k": "v"}}}
{"a":3,"b":"3","wrapper":{"inside":"3.1", "map": {"k": "v"}}}

所以我想这样做:

class MyInnerClass(var inside: String, var map: Map[String, String]) extends java.io.Serializable {
  def getInside(): String = {inside}
  def setInside(newInside: String) {inside = newInside}
  def getMap(): Map[String, String] = {map}
  def setMap(newMap: Map[String, String]) {map = newMap}
}

class MyClass(var a: Int, var b: String, var wrapper: MyInnerClass)  extends java.io.Serializable {
  def getA(): Int = {a}
  def setA(newA: Int) {a = newA}
  def getB(): String = {b}
  def setB(newB: String) {b = newB}
  def getWrapper(): MyInnerClass = {wrapper}
  def setWrapper(newWrapper: MyInnerClass) {wrapper = newWrapper}
}

然而,抛出:

val json = spark.read.json("path/to/json")
json.as[MyClass]

所以,我发现了自定义编码器:(herehere

Unable to find encoder for type stored in a Dataset.  Primitive type (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.

引发:

import org.apache.spark.sql.Encoders
val kryoMyClassEncoder  = Encoders.kryo[MyClass]
json.as[MyClass](kryoMyClassEncoder)

那么如何将Dataframe转换为自定义对象Dataset。

2 个答案:

答案 0 :(得分:1)

请尝试使用产品编码器,而不是使用kryo编码器,即:

val productMyClassEncoder  = Encoders.product[MyClass]

答案 1 :(得分:1)

使用方法内部的案例类声明时遇到的相同问题(无济于事)。将类移至方法import spark.implicits._之外后,