Scala Spark Dataframe来自对象列的新列

时间:2018-04-23 07:43:05

标签: scala apache-spark spark-dataframe

我有一个包含Polyline列的数据框(来自Magellan)。 我想将此列的一些字段提取到新列。 这是我想要做的一个例子:

spark.read
      .format("magellan")
      .load(My_Path)
      .withColumn("xcoordinates",$"polyline"("xcoordinates")) // Do not work
      .drop("polyline")

但后来我收到了错误:

Can't extract value from polyline#1190: need struct type but got polyline;

以下是数据样本:

DF:(id,polyline,otherColumns)

ID1, {"xcoordinates":[55.37,55.376],"indices":[0],"empty":false,"ycoordinates":[25.23,25.232],"boundingBox":{"xmin":55.376,"ymin":25.23,"xmax":55.376,"ymax":25.234},"valid":true,"type":3}, ...
ID2, {"xcoordinates":[55.37,55.376],"indices":[0],"empty":false,"ycoordinates":[25.23,25.232],"boundingBox":{"xmin":55.376,"ymin":25.23,"xmax":55.376,"ymax":25.234},"valid":true,"type":3}, ...
ID3, {"xcoordinates":[55.37,55.376],"indices":[0],"empty":false,"ycoordinates":[25.23,25.232],"boundingBox":{"xmin":55.376,"ymin":25.23,"xmax":55.376,"ymax":25.234},"valid":true,"type":3}, ...

预期输出的一个例子:

DF2:(id,xcoordinates,otherColumns)

ID1, [55.37,55.376], ...
ID2, [55.37,55.376], ...
ID3, [55.37,55.376], ...

编辑: 我终于设法做了我想要的事情:

import magellan.PolyLine

val xcoordinates = (data: PolyLine) => data.xcoordinates
val getXcoordinatesUDF = udf(xcoordinates)

 spark.read
          .format("magellan")
          .load(My_Path)
          .withColumn("xcoordinates",getXcoordinatesUDF($"polyline"))
          .drop("polyline")

0 个答案:

没有答案