我的要求是从列comment
中的评论列中检索订单号,并始终以R
开头。订单号应作为新列添加到表中。
输入数据:
code,id,mode,location,status,comment
AS-SD,101,Airways,hyderabad,D,order got delayed R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged
TY-OP,103,Airways,Pune,D,Order number R5463 not received
预期产出:
AS-SD,101,Airways,hyderabad,D,order got delayed R1657,R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged,R7856
TY-OP,103,Airways,Pune,D,Order number R5463 not received,R5463
我在spark-sql中尝试过,我正在使用的查询如下:
val r = sqlContext.sql("select substring(comment, PatIndex('%[0-9]%',comment, length(comment))) as number from A")
但是,我收到以下错误:
org.apache.spark.sql.AnalysisException: undefined function PatIndex; line 0 pos 0
答案 0 :(得分:9)
您可以使用具有以下定义的regexp_extract
:
def regexp_extract(e: Column, exp: String, groupIdx: Int): Column
(R\\d{4})
表示R
后跟4位数字。您可以使用有效的正则表达式
df.withColumn("orderId", regexp_extract($"comment", "(R\\d{4})" , 1 )).show
+-----+---+-------+---------+------+--------------------+-------+
| code| id| mode| location|status| comment|orderId|
+-----+---+-------+---------+------+--------------------+-------+
|AS-SD|101|Airways|hyderabad| D|order got delayed...| R1657|
|FY-YT|102|Airways| Delhi| ND|R7856 package dam...| R7856|
|TY-OP|103|Airways| Pune| D|Order number R546...| R5463|
+-----+---+-------+---------+------+--------------------+-------+
答案 1 :(得分:3)
您可以使用udf
功能,如下所示
import org.apache.spark.sql.functions._
def extractString = udf((comment: String) => comment.split(" ").filter(_.startsWith("R")).head)
df.withColumn("newColumn", extractString($"comment")).show(false)
其中comment
列split
为 space ,filter
以 R 开头。 head
将采用从R
开始过滤的第一个单词。
<强>更新强>
要确保返回的字符串订单号以R开头,其余字符串为 digits ,您可以添加其他filter
import scala.util.Try
def extractString = udf((comment: String) => comment.split(" ").filter(x => x.startsWith("R") && Try(x.substring(1).toDouble).isSuccess).head)
您可以根据需要修改filter
。