用于数据帧的Spark自定义编码器

时间:2017-03-14 13:07:37

标签: scala apache-spark apache-spark-sql spark-dataframe encoder

我知道How to store custom objects in Dataset?但是,我还不清楚如何构建这个自定义编码器,它正确地序列化为多个字段。我手动创建了一些函数{{3通过将对象映射到spark可以处理的基本类型,即元组Dataset - RDD - Dataset(编辑:下面的完整代码),在(String, Int)之间来回映射多边形。

例如,要从多边形对象转到(String, Int)的元组,我使用以下

def writeSerializableWKT(iterator: Iterator[AnyRef]): Iterator[(String, Int)] = {
    val writer = new WKTWriter()
    iterator.flatMap(cur => {
      val cPoly = cur.asInstanceOf[Polygon]
      // TODO is it efficient to create this collection? Is this a proper iterator 2 iterator transformation?
      List(((writer.write(cPoly), cPoly.getUserData.asInstanceOf[Int])).iterator
    })
  }
 def createSpatialRDDFromLinestringDataSet(geoDataset: Dataset[WKTGeometryWithPayload]): RDD[Polygon] = {
    geoDataset.rdd.mapPartitions(iterator => {
      val reader = new WKTReader()
      iterator.flatMap(cur => {
        try {
          reader.read(cur.lineString) match {
            case p: Polygon => {
              val polygon = p.asInstanceOf[Polygon]
              polygon.setUserData(cur.payload)
              List(polygon).iterator
            }
            case _ => throw new NotImplementedError("Multipolygon or others not supported")
          }
        } catch {
          case e: ParseException =>
            logger.error("Could not parse")
            logger.error(e.getCause)
            logger.error(e.getMessage)
            None
        }
      })
    })
  }

我注意到我已经开始做了很多工作两次(参见两种方法的链接)。现在想要能够处理

https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala/myOrg/GeoSpark.scala#L122-L154(完整代码如下)

/myOrg/GeoSpark.scala#L82-L84
 val joinResult = JoinQuery.SpatialJoinQuery(objectRDD, minimalPolygonCustom, true)
  //  joinResult.map()
  val joinResultCounted = JoinQuery.SpatialJoinQueryCountByKey(objectRDD, minimalPolygonCustom, true)

这是一个PairRDD[Polygon, HashSet[Polygon]],或者PairRDD[Polygon, Int]我怎么需要将我的函数指定为编码器才能再解决相同的问题2次?

0 个答案:

没有答案