我正在尝试从字符串中生成密集的Vector。但首先,我需要转换为两倍。如何以双格式获取它?
[-- feature: string (nullable = false)]
https://i.stack.imgur.com/u1kWz.png
我尝试过:
val new_col = df.withColumn("feature", df("feature").cast(DoubleType))
但是,它导致Null列。
答案 0 :(得分:0)
一种方法是使用UDF:
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
val df = Seq(
"-1,-1,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0",
"7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,",
"12.0,10.0,10.0,10.0,12.0,12.0,10.0,10.0,10.0,12.0",
"-1,-1,-1,-1,-1,-1,-1,5.0,9.0,9.0"
).toDF("feature")
def stringToVector = udf ( (s: String) =>
new DenseVector(s.split(",").map(_.toDouble))
)
df.withColumn("feature", stringToVector($"feature")).
show(false)
// +---------------------------------------------------+
// |feature |
// +---------------------------------------------------+
// |[-1.0,-1.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0]|
// |[7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0] |
// |[12.0,10.0,10.0,10.0,12.0,12.0,10.0,10.0,10.0,12.0]|
// |[-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,5.0,9.0,9.0] |
// +---------------------------------------------------+
答案 1 :(得分:0)
首先,我需要转换为双精度。如何以双格式获取它?
您可以简单地使用split
内置函数并按如下所示强制转换为Array[Double]
import org.apache.spark.sql.functions._
val new_col = df.withColumn("feature", split(df("feature"), ",").cast("array<double>"))
应该给您
root
.....
.....
|-- feature: array (nullable = true)
| |-- element: double (containsNull = true)
.....
.....
我希望答案会有所帮助