如何在scala中访问spark数据帧的列索引以进行计算

时间:2018-05-09 12:02:01

标签: scala spark-dataframe

我是Scala编程的新手,我在R上工作非常广泛但是在为scala工作时,很难在循环中工作以提取特定列来对列值执行计算

让我借助一个例子来解释:

我加入2个数据帧后到达最终数据帧, 现在我需要执行enter image description here

之类的计算

以上是参考列的计算,因此在计算之后我们将获得以下火花数据帧

enter image description here

如何引用for循环中的列索引来计算scala中spark数据帧中的新列值

1 个答案:

答案 0 :(得分:1)

这是一个解决方案:

Input Data:
+---+---+---+---+---+---+---+---+---+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |
+---+---+---+---+---+---+---+---+---+
|24 |74 |74 |21 |66 |65 |100|27 |19 |
+---+---+---+---+---+---+---+---+---+

压缩列以删除不匹配的列:

val oneCols = data.schema.filter(_.name.contains("1")).map(x => x.name).sorted
val twoCols = data.schema.filter(_.name.contains("2")).map(x => x.name).sorted
val cols = oneCols.zip(twoCols) 

//cols: Seq[(String, String)] = List((a1,a2), (b1,b2), (c1,c2), (d1,d2))

使用foldLeft函数动态添加列:

import org.apache.spark.sql.functions._
val result = cols.foldLeft(data)((data,c) => data.withColumn(s"Diff_${c._1}",
                                                  (col(s"${lit(c._2)}") - col(s"${lit(c._1)}"))/col(s"${lit(c._2)}")))

结果如下:

result.show(false)  

+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |Diff_a1           |Diff_b1|Diff_c1            |Diff_d1             |
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|24 |74 |74 |21 |66 |65 |100|27 |19 |0.6307692307692307|0.26   |-1.7407407407407407|-0.10526315789473684|
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+