从字符串PySpark Dataframe Column中删除正则表达式

时间:2018-01-11 13:25:42

标签: python regex pyspark apache-spark-sql

我需要从pyspark数据框中的一列字符串中删除正则表达式

    if let img = selectedImageImageView.image {
    if let data = UIImagePNGRepresentation(img) {
        if let datastring = data.base64EncodedStringWithOptions(.Encoding64CharacterLineLength) {
            println(datastring)
        }
    }
}

时间戳,例如10H03,是必须删除的正则表达式。

df= spark.createDataFrame([("Dog 10H03", "10H03"), ("Cat 09H24 eats rat", "09H24"), ("Mouse 09H45 runs away", "09H45"), ("Mouse 09H45 enters room", "09H45")],["Animal", "Time"])

+--------------------+------------------+-----+ | Animal| Animal_strip_time| Time| +--------------------+------------------+-----+ | Dog 10H03| Dog |10H03| | Cat 09H24 eats rat| Cat eats rat|09H24| |Mouse 09H45 runs ...| Mouse runs away|09H45| |Mouse 09H45 enter...|Mouse enters room|09H45| +--------------------+------------------+-----+ 中的时间戳可能与列Time中的时间戳不同。因此,它不能用于匹配字符串。

正则表达式应遵循XXHXX的模式,其中X是0-9之间的数字

1 个答案:

答案 0 :(得分:3)

这应该做的工作:

from pyspark.sql import functions as F
df.withColumn('Animal_strip_time', F.regexp_replace('Animal', '\d\dH\d\d', ''))