我有一个Dataframe:
| subcategory | subcategory_label | category |
| 00EEE | 00EEE FFF | Drink |
| 0000EEE | 00EEE FFF | Fruit |
| 0EEE | 000EEE FFF | Meat |
我需要从Dataframe中的列中删除前导0并需要这样的结果
| subcategory | subcategory_label | category |
| EEE | EEE FFF | Drink |
| EEE | EEE FFF | Fruit |
| EEE | EEE FFF | Meat |
到目前为止,我可以使用
从一列删除前导0df.withColumn("subcategory ", regexp_replace(df("subcategory "), "^0*", "")).show
如何一次性从数据框中删除前导0?
答案 0 :(得分:3)
将此作为提供的数据帧:
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|0000FFFF |0000EE 000FF |ABC |
+-----------+-----------------+--------+
您可以为所有列创建regexp_replace
。类似的东西:
val regex_all = df.columns.map( c => regexp_replace(col(c), "^0*", "" ).as(c) )
然后,使用select
,因为它需要Column
类型的变量:
df.select(regex_all :_* ).show(false)
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|FFFF |EE 000FF |ABC |
+-----------+-----------------+--------+
编辑:
定义一个函数来返回regexp_replace
d序列是直截了当的:
/**
* @param origCols total cols in the DF, pass `df.columns`
* @param replacedCols `Seq` of columns for which expression is to be generated
* @return `Seq[org.apache.spark.sql.Column]` Spark SQL expression
*/
def createRegexReplaceZeroes(origCols : Seq[String], replacedCols: Seq[String] ) = {
origCols.map{ c =>
if(replacedCols.contains(c)) regexp_replace(col(c), "^0*", "" ).as(c)
else col(c)
}
}
此函数将返回Array[org.apache.spark.sql.Column]
现在,将要替换的列存储在数组中:
val removeZeroes = Array( "subcategory", "subcategory_label" )
然后,以removeZeroes
为参数调用函数。这将返回regexp_replace
removeZeroes
语句
df.select( createRegexReplaceZeroes(df.columns, removeZeroes) :_* )
答案 1 :(得分:0)
您可以使用 UDF 来做同样的事情。 我觉得它看起来更优雅。
scala> val removeLeadingZerosUDF = udf({ x: String => x.replaceAll("^0*", "") })
removeLeadingZerosUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df = Seq( "000012340023", "000123400023", "001234000230", "012340002300", "123400002300" ).toDF("cols")
df: org.apache.spark.sql.DataFrame = [cols: string]
scala> df.show()
+------------+
| cols|
+------------+
|000012340023|
|000123400023|
|001234000230|
|012340002300|
|123400002300|
+------------+
scala> df.withColumn("newCols", removeLeadingZerosUDF($"cols")).show()
+------------+------------+
| cols| newCols|
+------------+------------+
|000012340023| 12340023|
|000123400023| 123400023|
|001234000230| 1234000230|
|012340002300| 12340002300|
|123400002300|123400002300|
+------------+------------+