如何将数据框转换为列表(Scala)?

时间:2019-04-24 11:11:50

标签: scala list apache-spark dataframe

我想将包含Double值的数据框转换为List,以便可以使用它进行计算。您的建议是什么,以便我可以选择正确的列表类型(即Double)?

我的方法是这样

var newList = myDataFrame.collect().toList 

但是它返回类型 List [org.apache.spark.sql.Row] ,我不知道它到底是什么!

是否有可能忘记这一步,而只是将我的Dataframe传递给函数并从中进行计算? (例如,我想将第二列的第三个元素与特定的double进行比较。是否可以直接从我的数据框进行比较?

不惜一切代价,我必须每次都了解如何创建正确的类型列表!

编辑:

输入数据框:

+---+---+ 
|_c1|_c2|
+---+---+ 
|0  |0  | 
|8  |2  | 
|9  |1  | 
|2  |9  | 
|2  |4  | 
|4  |6  | 
|3  |5  | 
|5  |3  | 
|5  |9  | 
|0  |1  | 
|8  |9  | 
|1  |0  | 
|3  |4  |
|8  |7  | 
|4  |9  | 
|2  |5  | 
|1  |9  | 
|3  |6  |
+---+---+

转换后的结果:

List((0,0), (8,2), (9,1), (2,9), (2,4), (4,6), (3,5), (5,3), (5,9), (0,1), (8,9), (1,0), (3,4), (8,7), (4,9), (2,5), (1,9), (3,6))

但是列表中的每个元素都必须是Double类型。

3 个答案:

答案 0 :(得分:2)

您可以将所需的电量转换为Double并将其转换为RDD并collect

如果您有无法解析的数据,则可以在将数据转换为double之前使用udf进行清理

val stringToDouble = udf((data: String) => {
  Try (data.toDouble) match {
    case Success(value) => value
    case Failure(exception) => Double.NaN
  }
})

 val df = Seq(
   ("0.000","0"),
   ("0.000008","24"),
   ("9.00000","1"),
   ("-2","xyz"),
   ("2adsfas","1.1.1")
 ).toDF("a", "b")
  .withColumn("a", stringToDouble($"a").cast(DoubleType))
  .withColumn("b", stringToDouble($"b").cast(DoubleType))

此后,您将输出为

+------+----+
|a     |b   |
+------+----+
|0.0   |0.0 |
|8.0E-6|24.0|
|9.0   |1.0 |
|-2.0  |NaN |
|NaN   |NaN |
+------+----+

获取Array[(Double, Double)]

val result = df.rdd.map(row => (row.getDouble(0), row.getDouble(1))).collect()

结果将为Array[(Double, Double)]

答案 1 :(得分:0)

#Convert DataFrame to DataSet using case class & then convert it to list

#It'll return the list of type of your class object.All the variables inside the #class(mapping to fields in your table)will be pre-typeCasted) Then you won't need to #type cast every time.

#Please execute below code to check it-
#Sample to check & verify(scala)-

val wa = Array("one","two","two")
val wr = sc.parallelize(wa,3).map(x=>(x,"x",1))
val wdf = wr.toDF("a","b","c")
case class wc(a:String,b:String,c:Int)
val myList= wds.collect.toList
myList.foreach(x=>println(x))
myList.foreach(x=>println(x.a.getClass,x.b.getClass,x.c.getClass))

答案 2 :(得分:-1)

myDataFrame.select("_c1", "_c2").collect().map(each => (each.getAs[Double]("_c1"), each.getAs[Double]("_c2"))).toList