如何在scala(spark)中的特定字符串之后提取值?

时间:2018-10-11 06:23:32

标签: regex scala apache-spark dataframe

我有一个带有Column的数据框:

df =

itemType                   count
it_shampoo                  5
it_books                    5
it_mm                       5
{it_mm}                     5
it_books it_books           5
{=it_books} it_books        5

我需要得到:

itemType                   count
it_shampoo                  5
it_books                    5
it_mm                       5
it_mm                       5
it_books                    5
it_books                    5

我如何提取将it_books it_books{=it_books} it_books替换为it_books。项目类型将始终跟随it_

2 个答案:

答案 0 :(得分:1)

尝试将正则表达式^.*?(it_[\w]+).*$设置为itemType并替换为第一个捕获的组$1

Regex

答案 1 :(得分:0)

下面的正则表达式也可以使用

scala> val df = Seq(("it_shampoo",5),
     | ("it_books",5),
     | ("it_mm",5),
     | ("{it_mm}",5),
     | ("it_books it_books",5),
     | ("{=it_books} it_books",5)).toDF("itemType","count")
df: org.apache.spark.sql.DataFrame = [itemType: string, count: int]

scala> df.select( regexp_replace('itemtype,""".*\b(\S+)\b(.*)$""", "$1").as("replaced"),'count).show
+----------+-----+
|  replaced|count|
+----------+-----+
|it_shampoo|    5|
|  it_books|    5|
|     it_mm|    5|
|     it_mm|    5|
|  it_books|    5|
|  it_books|    5|
+----------+-----+


scala>