最初我有一个矩阵
0.0 0.4 0.4 0.0
0.1 0.0 0.0 0.7
0.0 0.2 0.0 0.3
0.3 0.0 0.0 0.0
矩阵matrix
被
normal_array
。
`val normal_array = matrix.toArray`
我有一个字符串数组
inputCols : Array[String] = Array(p1, p2, p3, p4)
我需要将此矩阵转换为以下数据帧。 (注意:矩阵中的行数和列数将与inputCols
的长度相同)
index p1 p2 p3 p4
p1 0.0 0.4 0.4 0.0
p2 0.1 0.0 0.0 0.7
p3 0.0 0.2 0.0 0.3
p4 0.3 0.0 0.0 0.0
在python中,这可以通过pandas
库轻松实现。
arrayToDataframe = pandas.DataFrame(normal_array,columns = inputCols, index = inputCols)
但是如何在Scala
中做到这一点?
答案 0 :(得分:2)
您可以执行以下操作
//convert your data to Scala Seq/List/Array
val list = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
//Define your Array of desired columns
val inputCols : Array[String] = Array("p1", "p2", "p3", "p4")
//Create DataFrame from given data, It will create dataframe with its own column names like _c1,_c2 etc
val df = sparkSession.createDataFrame(list)
//Getting the list of column names from dataframe
val dfColumns=df.columns
//Creating query to rename columns
val query=inputCols.zipWithIndex.map(index=>dfColumns(index._2)+" as "+inputCols(index._2))
//Firing above query
val newDf=df.selectExpr(query:_*)
//Creating udf which get index(0,1,2,3) as input and returns corresponding column name from your given array of columns
val getIndexUDF=udf((row_no:Int)=>inputCols(row_no))
//Adding temporary column row_no which contains index of row and removing after adding index column
val dfWithRow=newDf.withColumn("row_no",monotonicallyIncreasingId).withColumn("index",getIndexUDF(col("row_no"))).drop("row_no")
dfWithRow.show
示例输出:
+---+---+---+---+-----+
| p1| p2| p3| p4|index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0| p1|
|0.1|0.0|0.0|0.7| p2|
|0.0|0.2|0.0|0.3| p3|
|0.3|0.0|0.0|0.0| p4|
+---+---+---+---+-----+
答案 1 :(得分:2)
这是另一种方式:
val data = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
val cols = Array("p1", "p2", "p3", "p4","index")
压缩集合并将其转换为DataFrame。
data.zip(cols).map {
case (col,index) => (col._1,col._2,col._3,col._4,index)
}.toDF(cols: _*)
输出:
+---+---+---+---+-----+
|p1 |p2 |p3 |p4 |index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0|p1 |
|0.1|0.0|0.0|0.7|p2 |
|0.0|0.2|0.0|0.3|p3 |
|0.3|0.0|0.0|0.0|p4 |
+---+---+---+---+-----+
答案 2 :(得分:0)
更新和较短的版本应如下所示 适用于Spark版本> 2.4.5。 请找到语句的内联描述
val spark = SparkSession.builder()
.master("local[*]")
.getOrCreate()
import spark.implicits._
val cols = (1 to 4).map( i => s"p$i")
val listDf = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
.toDF(cols: _*) // Map the data to new column names
.withColumn("index", // Create a column with auto increasing id
functions.concat(functions.lit("p"),functions.monotonically_increasing_id()))
listDf.show()