将字符串(或字符串列表)拆分为spark数据帧

时间:2017-01-27 19:31:00

标签: scala apache-spark pyspark apache-spark-sql spark-dataframe

给定数据帧“df”和列“colStr”列表,Spark Dataframe中是否有一种方法可以从数据框中提取或引用这些列。

这是一个例子 -

val in = sc.parallelize(List(0, 1, 2, 3, 4, 5))
val df = in.map(x => (x, x+1, x+2)).toDF("c1", "c2", "c3")
val keyColumn = "c2" // this is either a single column name or a string of column names delimited by ','
val keyGroup = keyColumn.split(",").toSeq.map(x => col(x))

import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._

val ranker = Window.partitionBy(keyGroup).orderBy($"c2")

val new_df= df.withColumn("rank", rank.over(ranker))

new_df.show()

以上错误

error: overloaded method value partitionBy with alternatives
(cols:org.apache.spark.sql.Column*)org.apache.spark.sql.expressions.WindowSpec <and>
(colName: String,colNames: String*)org.apache.spark.sql.expressions.WindowSpec
cannot be applied to (Seq[org.apache.spark.sql.Column])

感谢帮助。谢谢!

1 个答案:

答案 0 :(得分:3)

如果您尝试按keyGroup列表中的列对数据框进行分组,则可以将keyGroup: _*作为参数传递给partitionBy函数:

val ranker = Window.partitionBy(keyGroup: _*).orderBy($"c2")    
val new_df= df.withColumn("rank", rank.over(ranker))

new_df.show
+---+---+---+----+
| c1| c2| c3|rank|
+---+---+---+----+
|  0|  1|  2|   1|
|  5|  6|  7|   1|
|  2|  3|  4|   1|
|  4|  5|  6|   1|
|  3|  4|  5|   1|
|  1|  2|  3|   1|
+---+---+---+----+