我需要从pyspark数据框中的一列字符串中删除正则表达式
if let img = selectedImageImageView.image {
if let data = UIImagePNGRepresentation(img) {
if let datastring = data.base64EncodedStringWithOptions(.Encoding64CharacterLineLength) {
println(datastring)
}
}
}
时间戳,例如10H03,是必须删除的正则表达式。
df= spark.createDataFrame([("Dog 10H03", "10H03"), ("Cat 09H24 eats rat", "09H24"), ("Mouse 09H45 runs away", "09H45"), ("Mouse 09H45 enters room", "09H45")],["Animal", "Time"])
列+--------------------+------------------+-----+
| Animal| Animal_strip_time| Time|
+--------------------+------------------+-----+
| Dog 10H03| Dog |10H03|
| Cat 09H24 eats rat| Cat eats rat|09H24|
|Mouse 09H45 runs ...| Mouse runs away|09H45|
|Mouse 09H45 enter...|Mouse enters room|09H45|
+--------------------+------------------+-----+
中的时间戳可能与列Time
中的时间戳不同。因此,它不能用于匹配字符串。
正则表达式应遵循XXHXX的模式,其中X是0-9之间的数字
答案 0 :(得分:3)
这应该做的工作:
from pyspark.sql import functions as F
df.withColumn('Animal_strip_time', F.regexp_replace('Animal', '\d\dH\d\d', ''))