如果字符串中存在数字,请将该字符串替换为null - Spark

时间:2017-06-12 22:04:46

标签: scala apache-spark spark-dataframe

我是Spark-Scala的新手。我正在尝试清理一些数据。我在清理FIRSTNAME和LASTNAME列时遇到问题。字符串中有数字。如何识别数字并用null替换整个字符串。

Consider the following dataframe:

+---------+--------+
|FIRSTNAME|LASTNAME|
+---------+--------+
|    Steve|    10 C|
|     Mark|    9436|
|    Brian|    Lara|
+---------+--------+

How do I get this:

+---------+--------+
|FIRSTNAME|LASTNAME|
+---------+--------+
|    Steve|    null|
|     Mark|    null|
|    Brian|    Lara|
+---------+--------+

非常感谢任何帮助。非常感谢你!

编辑:

scala> df2.withColumn("LASTNAME_TEMP", when(col("LASTNAME").contains("1"), null).otherwise(col("LASTNAME"))).show()
+---------+--------+-------------+
|FIRSTNAME|LASTNAME|LASTNAME_TEMP|
+---------+--------+-------------+
|    Steve|    10 C|         null|
|     Mark|    9436|         9436|
|    Brian|    Lara|         Lara|
+---------+--------+-------------+

但是上面的代码只接受一个字符串。我更喜欢它采取字符串列表。例如:

val numList = List("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")

我声明了上面的列表并运行了以下代码:

scala> df2.filter(col("LASTNAME").isin(numList:_*)).show()

我得到了以下数据框:

+---------+--------+
|FIRSTNAME|LASTNAME|
+---------+--------+
+---------+--------+

1 个答案:

答案 0 :(得分:3)

您可以使用正则表达式与rlike进行模式匹配:

val df = Seq(
  ("Steve", "10 C"),
  ("Mark", "9436"),
  ("Brian", "Lara")
).toDF(
  "FIRSTNAME", "LASTNAME"
)

// Keep original LASTNAME in new column only if it doesn't consist of any digit
val df2 = df.withColumn( "LASTNAMEFIXED", when( ! col("LASTNAME").rlike(".*[0-9]+.*"), col("LASTNAME") ) )

+---------+--------+-------------+
|FIRSTNAME|LASTNAME|LASTNAMEFIXED|
+---------+--------+-------------+
|    Steve|    10 C|         null|
|     Mark|    9436|         null|
|    Brian|    Lara|         Lara|
+---------+--------+-------------+