我有一个带有Column的数据框:
df =
itemType count
it_shampoo 5
it_books 5
it_mm 5
{it_mm} 5
it_books it_books 5
{=it_books} it_books 5
我需要得到:
itemType count
it_shampoo 5
it_books 5
it_mm 5
it_mm 5
it_books 5
it_books 5
我如何提取将it_books it_books
,{=it_books} it_books
替换为it_books
。项目类型将始终跟随it_
答案 0 :(得分:1)
尝试将正则表达式^.*?(it_[\w]+).*$
设置为itemType并替换为第一个捕获的组$1
。
答案 1 :(得分:0)
下面的正则表达式也可以使用
scala> val df = Seq(("it_shampoo",5),
| ("it_books",5),
| ("it_mm",5),
| ("{it_mm}",5),
| ("it_books it_books",5),
| ("{=it_books} it_books",5)).toDF("itemType","count")
df: org.apache.spark.sql.DataFrame = [itemType: string, count: int]
scala> df.select( regexp_replace('itemtype,""".*\b(\S+)\b(.*)$""", "$1").as("replaced"),'count).show
+----------+-----+
| replaced|count|
+----------+-----+
|it_shampoo| 5|
| it_books| 5|
| it_mm| 5|
| it_mm| 5|
| it_books| 5|
| it_books| 5|
+----------+-----+
scala>