我有两个包含id,features和mean的表数据,features由13849个value组成。
+---+--------------------+
| id| features|
+---+--------------------+
| 10|[5.82797050476074...|
| 20|[2.75361084938049...|
| 30|[-2.2027940750122...|
| 40|[4.20199108123779...|
| 50|[2.69677162170410...|
| 60|[2.65212917327880...|
| 70|[3.83443570137023...|
| 80|[0.45349338650703...|
| 90|[3.12527608871459...|
+---+--------------------+
第二张表:
+------------------+
| value|
+------------------+
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
|2.4848911616270923|
+------------------+
代码:
case class DataClass(id: Int, features:Double)
val newDataDF = spark.read.parquet("file:///usr/local/spark/dataset/model/data/user/part-r-00000-7d55ba81-5761-4e36-b488-7e6214df2a68.snappy.parquet").toDF()//.toDF()//.map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim.toDouble)).toDF()
newDataDF.withColumn("features", ((newDataDF("features")-2.4848911616270923)/1.8305483113586494))
给我错误
由于数据类型不匹配而无法解析'(
features
-2.4848911616270923D)':'(features
-2.4848911616270923D)'中的类型不同(数组和双精度)。 如何解决?
答案 0 :(得分:0)
尝试使用:
val dfWithCalculatedFeatures = newDataDF.withColumn("features", (col("features")(0) - 2.4848911616270923)/1.8305483113586494)