如何从不同的表中减去两列?

时间:2019-11-14 14:18:44

标签: scala apache-spark-sql subtraction

我有两个包含id,features和mean的表数据,features由13849个value组成。

+---+--------------------+
| id|            features|
+---+--------------------+
| 10|[5.82797050476074...|
| 20|[2.75361084938049...|
| 30|[-2.2027940750122...|
| 40|[4.20199108123779...|
| 50|[2.69677162170410...|
| 60|[2.65212917327880...|
| 70|[3.83443570137023...|
| 80|[0.45349338650703...|
| 90|[3.12527608871459...|

+---+--------------------+

第二张表:

+------------------+
|             value|
+------------------+
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
+------------------+

代码:

case class DataClass(id: Int, features:Double)
val newDataDF = spark.read.parquet("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet").toDF()//.toDF()//.map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim.toDouble)).toDF()
newDataDF.withColumn("features", ((newDataDF("features")-2.4848911616270923)/1.8305483113586494))

给我错误

  

由于数据类型不匹配而无法解析'(features-2.4848911616270923D)':'(features-2.4848911616270923D)'中的类型不同(数组和双精度)。   如何解决?

1 个答案:

答案 0 :(得分:0)

尝试使用:

val dfWithCalculatedFeatures = newDataDF.withColumn("features", (col("features")(0) - 2.4848911616270923)/1.8305483113586494)