我将csv文件读入Spark DataFrame,并根据cvs文件头推断出列名:
val df = spark.read
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.csv("users.csv")
现在我需要转换列值,例如:
val modifedDf1 = df.withColumn("country", when(col("country") === "Italy", "[ITALY]").otherwise(col("country")))
val modifedDf2 = modifedDf1.withColumn("city", when(col("city") === "Milan", "[MILAN]").otherwise(col("city")))
您可能会看到,为了修改列值,我需要明确选择列withColumn("city"..
,然后应用条件。
现在,我需要为要修改的每一列重复此代码。
是否可以重写此代码以便迭代df
DataFrame中的每一列并应用以下内容(以伪代码):
df.foreachColumn {
if (col_name == 'country'))
then when(col_value === "Italy", "[ITALY]").otherwise(col_value)
else if (col_name == 'city'))
then when(col_value === "Milan", "[MILAN]").otherwise(col_value)
}
我会欣赏Scala中的示例。
已更新
这是我的原始df:
+------+------------------+--------------+-------------+
|name |email |phone |country |
+------+------------------+--------------+-------------+
|Mike |mike@example.com |+91-9999999999|Italy |
|Alex |alex@example.com |+91-9999999998|France |
|John |john@example.com |+1-1111111111 |United States|
|Donald|donald@example.com|+1-2222222222 |United States|
+------+------------------+--------------+-------------+
我现在有以下代码:
val columnsModify = df.columns.map(col).map(column => {
val columnName = s"${column}"
if (columnName == "country") {
column as "[COUNTRY]"
} else if (columnName == "email") {
column as "(EMAIL)"
} else {
column as columnName
}
})
能够迭代DataFrame列并根据指定条件更改其名称。
这是输出:
+------+------------------+--------------+-------------+
|name |(EMAIL) |phone |[COUNTRY] |
+------+------------------+--------------+-------------+
|Mike |mike@example.com |+91-9999999999|Italy |
|Alex |alex@example.com |+91-9999999998|France |
|John |john@example.com |+1-1111111111 |United States|
|Donald|donald@example.com|+1-2222222222 |United States|
+------+------------------+--------------+-------------+
我还需要为列值添加转换逻辑,如下所示(请参见下面的注释行):
val columnsModify = df.columns.map(col).map(column => {
val columnName = s"${column}"
if (columnName == "country") {
//when(column_value === "Italy", "[ITALY]").otherwise(column_value)
column as "[COUNTRY]"
} else if (columnName == "email") {
column as "(EMAL)"
} else {
column as columnName
}
})
,此脚本的预期输出应为:
+------+------------------+--------------+-------------+
|name |(EMAL) |phone |[COUNTRY] |
+------+------------------+--------------+-------------+
|Mike |mike@example.com |+91-9999999999|[ITALY] |
|Alex |alex@example.com |+91-9999999998|France |
|John |john@example.com |+1-1111111111 |United States|
|Donald|donald@example.com|+1-2222222222 |United States|
+------+------------------+--------------+-------------+
请说明如何实现。
答案 0 :(得分:1)
val newCols = df.schema.map{
column =>
val colName = column.name
colName match{
case "country" => when(col(colName) === "Italy", "ITALY").otherwise(col(colName)).as("[COUNTRY]")
case "email" => col(colName).as("[EMAIL]")
case _ => col(colName)
}
}
df.select(newCols.head, newCols.tail: _*)
答案 1 :(得分:0)
如何使用df.selectExpr
scala> :paste
// Entering paste mode (ctrl-D to finish)
val sel2 = df.columns.map( x =>
if(x=="country") "CASE WHEN country = 'Italy' THEN '[ITALY]' ELSE country end as `[country]` "
else if(x=="email") " email as `(EMAL)`"
else x
)
// Exiting paste mode, now interpreting.
sel2: Array[String] = Array(name, " email as `(EMAL)`", phone, "CASE WHEN country = 'Italy' THEN '[ITALY]' ELSE country end as `[country]` ")
scala> df.selectExpr(sel2:_*).show
+------+------------------+--------------+-------------+
| name| (EMAL)| phone| [country]|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| [ITALY]|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
+------+------------------+--------------+-------------+
scala>