字符串中的子字符串匹配

时间:2019-02-13 13:47:26

标签: python pyspark apache-spark-sql

我正在尝试通过使用另一列来查找和替换列字符串中的值。

我有两列标签和选择。

Table 1
id = 12
labels = case1|case2|case3

table 2
id =12
label&values = case1coke.case2fanta:case3cheez 

上面的示例是英语,但是label&values和labels列是日语。我尝试使用regex_replace,但是由于数据量大和许多特殊情况下的字符regex_replace对我不起作用。我正在寻找一种可以通过字符串匹配解决我的问题的方法

预期输出为:

id  label   value
12  case1   coke
12  case2   fanta
12  case3   juice

df = sqlContext.sql("select \
a.shop_id \
,a.item_id \
,regexp_replace \
    ( \
        regexp_replace \
            ( \
            a.choice \
            ,concat('(^|(?<![::]))(', concatlables, ')') -- this is not working for all of the japanese records in this case \ 
            ,'⚙$2⚛' \
            ) \
    ,'⚛[::]' ,'⚛' \
    ) as choice \
from \
rdsp_production_production_ex_odin_mall.basket_main2 a \
inner join brandmart.control_labels_concatlabels b \
on a.shop_id = b.shop_id \
and a.item_id = b.item_id \
where a.reg_date > '2019-02-07'")

r = df.select("shop_id","item_id",f.split("choice", "⚙").alias("final"),f.posexplode(f.split("choice", "⚙")).alias("pos", "val"))
split_col = split(r['choice'], '⚛')
r = r.withColumn('NAME1', split_col.getItem(0))
r = r.withColumn('NAME2', split_col.getItem(1)
  

错误   错误[Stage 23:>(6 + 2)/ 2977] 19/02/14 11:45:58 WARN Scheduler.TaskSetManager:在阶段23.0中丢失了任务972.0(TID 54,bhdp4411.prod.hnd1.bdd.local,执行者2):org.apache.spark.SparkException:在org.apache.spark.sql.execution.datasources.FileFormatWriter $ .org $ apache $ spark $ sql $ execution $ datasources $ FileFormatWriter $$ executeTask(FileFormatWriter .scala:204),位于org.apache.spark.sql.execution.datasources的org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ 3.apply(FileFormatWriter.scala:129) .FileFormatWriter $$ anonfun $ write $ 1 $$ anonfun $ 3.apply(org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在org.apache.spark.scheduler.Task在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:322)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)在.run(Task.scala:99) util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java :624)at java.lang.Thread.run(Thread.java:748)原因:java.util.regex.PatternSyntaxException:靠近索引79(^ |(?

0 个答案:

没有答案