修剪Scala中DataFrame的前导0

时间:2018-01-02 06:20:59

标签: scala apache-spark

我有一个Dataframe:

 | subcategory | subcategory_label | category  |
 | 00EEE       | 00EEE FFF         | Drink     |
 | 0000EEE     | 00EEE FFF         | Fruit     |
 | 0EEE        | 000EEE FFF        | Meat      |

我需要从Dataframe中的列中删除前导0并需要这样的结果

| subcategory | subcategory_label | category  |
| EEE         | EEE FFF           | Drink     |
| EEE         | EEE FFF           | Fruit     |
| EEE         | EEE FFF           | Meat      |

到目前为止,我可以使用

从一列删除前导0
df.withColumn("subcategory ", regexp_replace(df("subcategory "), "^0*", "")).show

如何一次性从数据框中删除前导0?

2 个答案:

答案 0 :(得分:3)

将此作为提供的数据帧:

+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|0000FFFF   |0000EE 000FF     |ABC     |
+-----------+-----------------+--------+

您可以为所有列创建regexp_replace。类似的东西:

val regex_all = df.columns.map( c => regexp_replace(col(c), "^0*", "" ).as(c) )

然后,使用select,因为它需要Column类型的变量:

df.select(regex_all :_* ).show(false)
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|FFFF       |EE 000FF         |ABC     |
+-----------+-----------------+--------+

编辑:

定义一个函数来返回regexp_replace d序列是直截了当的:

/**
  * @param origCols total cols in the DF, pass `df.columns`
  * @param replacedCols `Seq` of columns for which expression is to be generated
  * @return `Seq[org.apache.spark.sql.Column]` Spark SQL expression
  */
def createRegexReplaceZeroes(origCols : Seq[String], replacedCols: Seq[String] ) = {
    origCols.map{ c => 
        if(replacedCols.contains(c)) regexp_replace(col(c), "^0*", "" ).as(c) 
        else col(c)
    }
}

此函数将返回Array[org.apache.spark.sql.Column]

现在,将要替换的列存储在数组中:

val removeZeroes = Array( "subcategory", "subcategory_label" )

然后,以removeZeroes为参数调用函数。这将返回regexp_replace

中可用列的removeZeroes语句
df.select( createRegexReplaceZeroes(df.columns, removeZeroes) :_* )

答案 1 :(得分:0)

您可以使用 UDF 来做同样的事情。 我觉得它看起来更优雅。

scala> val removeLeadingZerosUDF = udf({ x: String => x.replaceAll("^0*", "") })
removeLeadingZerosUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> val df = Seq( "000012340023", "000123400023", "001234000230", "012340002300", "123400002300" ).toDF("cols")
df: org.apache.spark.sql.DataFrame = [cols: string]

scala> df.show()
+------------+
|        cols|
+------------+
|000012340023|
|000123400023|
|001234000230|
|012340002300|
|123400002300|
+------------+

scala> df.withColumn("newCols", removeLeadingZerosUDF($"cols")).show()
+------------+------------+
|        cols|     newCols|
+------------+------------+
|000012340023|    12340023|
|000123400023|   123400023|
|001234000230|  1234000230|
|012340002300| 12340002300|
|123400002300|123400002300|
+------------+------------+