Apache Spark迭代DataFrame列并应用值转换

时间:2018-11-08 12:01:07

标签: scala apache-spark apache-spark-sql

我将csv文件读入Spark DataFrame,并根据cvs文件头推断出列名:

val df = spark.read
  .format("org.apache.spark.csv")
  .option("header", true)
  .option("inferSchema", true)
  .csv("users.csv")

现在我需要转换列值,例如:

val modifedDf1 = df.withColumn("country", when(col("country") === "Italy", "[ITALY]").otherwise(col("country")))

val modifedDf2 = modifedDf1.withColumn("city", when(col("city") === "Milan", "[MILAN]").otherwise(col("city")))

您可能会看到,为了修改列值,我需要明确选择列withColumn("city"..,然后应用条件。

现在,我需要为要修改的每一列重复此代码。

是否可以重写此代码以便迭代df DataFrame中的每一列并应用以下内容(以伪代码):

df.foreachColumn {
    if (col_name == 'country')) 
        then when(col_value === "Italy", "[ITALY]").otherwise(col_value)
    else if (col_name == 'city')) 
        then when(col_value === "Milan", "[MILAN]").otherwise(col_value)
}

我会欣赏Scala中的示例。

已更新

这是我的原始df:

+------+------------------+--------------+-------------+
|name  |email             |phone         |country      |
+------+------------------+--------------+-------------+
|Mike  |mike@example.com  |+91-9999999999|Italy        |
|Alex  |alex@example.com  |+91-9999999998|France       |
|John  |john@example.com  |+1-1111111111 |United States|
|Donald|donald@example.com|+1-2222222222 |United States|
+------+------------------+--------------+-------------+

我现在有以下代码:

val columnsModify = df.columns.map(col).map(column => {
  val columnName = s"${column}"
  if (columnName == "country") {
    column as "[COUNTRY]"
  } else if (columnName == "email") {
    column as "(EMAIL)"
  } else {
    column as columnName
  }
})

能够迭代DataFrame列并根据指定条件更改其名称。

这是输出:

+------+------------------+--------------+-------------+
|name  |(EMAIL)           |phone         |[COUNTRY]    |
+------+------------------+--------------+-------------+
|Mike  |mike@example.com  |+91-9999999999|Italy        |
|Alex  |alex@example.com  |+91-9999999998|France       |
|John  |john@example.com  |+1-1111111111 |United States|
|Donald|donald@example.com|+1-2222222222 |United States|
+------+------------------+--------------+-------------+

我还需要为列值添加转换逻辑,如下所示(请参见下面的注释行):

val columnsModify = df.columns.map(col).map(column => {
  val columnName = s"${column}"
  if (columnName == "country") {
    //when(column_value === "Italy", "[ITALY]").otherwise(column_value)
    column as "[COUNTRY]"
  } else if (columnName == "email") {
    column as "(EMAL)"
  } else {
    column as columnName
  }
})

,此脚本的预期输出应为:

+------+------------------+--------------+-------------+
|name  |(EMAL)            |phone         |[COUNTRY]    |
+------+------------------+--------------+-------------+
|Mike  |mike@example.com  |+91-9999999999|[ITALY]      |
|Alex  |alex@example.com  |+91-9999999998|France       |
|John  |john@example.com  |+1-1111111111 |United States|
|Donald|donald@example.com|+1-2222222222 |United States|
+------+------------------+--------------+-------------+

请说明如何实现。

2 个答案:

答案 0 :(得分:1)

val newCols = df.schema.map{
  column =>

    val colName = column.name

    colName match{
      case "country" => when(col(colName) === "Italy", "ITALY").otherwise(col(colName)).as("[COUNTRY]") 
      case "email" => col(colName).as("[EMAIL]")
      case _ => col(colName) 
    } 
}

df.select(newCols.head, newCols.tail: _*)

答案 1 :(得分:0)

如何使用df.selectExpr

scala> :paste
// Entering paste mode (ctrl-D to finish)

 val sel2 = df.columns.map( x =>
 if(x=="country") "CASE WHEN country = 'Italy' THEN '[ITALY]' ELSE country  end as `[country]` "
 else if(x=="email") " email as `(EMAL)`"
 else x
 )

// Exiting paste mode, now interpreting.

sel2: Array[String] = Array(name, " email as `(EMAL)`", phone, "CASE WHEN country = 'Italy' THEN '[ITALY]' ELSE country  end as `[country]` ")

scala>  df.selectExpr(sel2:_*).show
+------+------------------+--------------+-------------+
|  name|            (EMAL)|         phone|    [country]|
+------+------------------+--------------+-------------+
|  Mike|  mike@example.com|+91-9999999999|      [ITALY]|
|  Alex|  alex@example.com|+91-9999999998|       France|
|  John|  john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
+------+------------------+--------------+-------------+


scala>