Spark Scala-需要遍历数据帧中的列

时间:2018-06-20 15:06:00

标签: scala apache-spark dataframe

获得下一个数据框:

+---+----------------+
|id |job_title       |
+---+----------------+
|1  |ceo             |
|2  |product manager |
|3  |surfer          |
+---+----------------+

我想从数据框中获取一列,并创建另一个指示为“ rank”的列:

+---+----------------+-------+
|id |job_title       | rank  |
+---+----------------+-------+
|1  |ceo             |c-level|
|2  |product manager |manager|
|3  |surfer          |other  |
+---+----------------+-------+

---更新---

我现在想做的是:

def func (col: column) : Column = {
val cLevel = List("ceo","cfo")
val managerLevel = List("manager","team leader")

when (col.contains(cLevel), "C-level")
.otherwise(when(col.contains(managerLevel),"manager").otherwise("other"))}

当前我收到此错误:

type mismatch;
found   : Boolean
required: org.apache.spark.sql.Column

我认为代码中还存在其他问题。很抱歉,但是我刚开始使用Scala而不是Spark。

2 个答案:

答案 0 :(得分:2)

在这种情况下,您可以使用My.Resources.Player内置函数作为

when/otherwise

,您可以使用import org.apache.spark.sql.functions._ def func = when(col("job_title").contains("cheif") || col("job_title").contains("ceo"), "c-level") .otherwise(when(col("job_title").contains("manager"), "manager") .otherwise("other")) 作为

来调用该函数
withColumn

应该给您

df.withColumn("rank", func).show(false)

我希望答案会有所帮助

已更新

我看到您已经用自己的尝试更新了帖子,并且尝试创建了关卡列表,并且希望对列表进行验证。在这种情况下,您必须将 udf函数编写为

+---+---------------+-------+
|id |job_title      |rank   |
+---+---------------+-------+
|1  |ceo            |c-level|
|2  |product manager|manager|
|3  |surfer         |other  |
+---+---------------+-------+

应该可以为您提供所需的输出

答案 1 :(得分:0)

 val df = sc.parallelize(Seq(
  (1,"ceo"),
  ( 2,"product manager"), 
  (3,"surfer"),
  (4,"Vaquar khan")
)).toDF("id", "job_title")

df.show()
//option 2
df.createOrReplaceTempView("user_details")


sqlContext.sql("SELECT job_title, RANK() OVER (ORDER BY id) AS rank FROM user_details").show


val df1 = sc.parallelize(Seq(
  ("ceo","c-level"), 
  ( "product manager","manager"),
  ("surfer","other"),
  ("Vaquar khan","Problem solver")
)).toDF("job_title", "ranks")
df1.show()
df1.createOrReplaceTempView("user_rank")


sqlContext.sql("SELECT user_details.id,user_details.job_title,user_rank.ranks FROM user_rank JOIN user_details ON user_rank.job_title = user_details.job_title order by user_details.id").show

结果:

+---+---------------+
| id|      job_title|
+---+---------------+
|  1|            ceo|
|  2|product manager|
|  3|         surfer|
|  4|    Vaquar khan|
+---+---------------+

+---------------+----+
|      job_title|rank|
+---------------+----+
|            ceo|   1|
|product manager|   2|
|         surfer|   3|
|    Vaquar khan|   4|
+---------------+----+

+---------------+--------------+
|      job_title|         ranks|
+---------------+--------------+
|            ceo|       c-level|
|product manager|       manager|
|         surfer|         other|
|    Vaquar khan|Problem solver|
+---------------+--------------+

+---+---------------+--------------+
| id|      job_title|         ranks|
+---+---------------+--------------+
|  1|            ceo|       c-level|
|  2|product manager|       manager|
|  3|         surfer|         other|
|  4|    Vaquar khan|Problem solver|
+---+---------------+--------------+

df: org.apache.spark.sql.DataFrame = [id: int, job_title: string]
df1: org.apache.spark.sql.DataFrame = [job_title: string, ranks: string]

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html