如何用定界符在火花中爆炸

时间:2018-10-05 05:03:05

标签: scala apache-spark explode

我有一张桌子:

id itemNames优惠券  有1件(foo bar)可用,肥皂为真  2项(栏)可用false  3肥皂,洗发水假  4项(foo bar,bar)可用true  5项(foo bar,bar)可用(肥皂,洗发水)true  6空false

我想将其爆炸

id itemNames优惠券   1个项目(foo bar)可用true
  1条肥皂   2项(栏)可用false   3肥皂假   3耻辱假   4项(foo bar,bar)可用true   5项(foo bar,bar)可用true   6(肥皂,洗发水)真
  6 null tr​​ue

当我这样做时:

 df.withColumn("itemNames", explode(split($"itemNames", "[,]")))

我得到了:

itemNames                                          coupons
item (foo bar) is available                        true       
soaps                                              true 
item (bar) is available                            false
soaps                                              false
shampoo                                            false
item (foo bar,                                     true
bar) is available                                  true 
(soap,                                             true    
shampoo)                                           true

有人可以告诉我我做错了什么,我该如何纠正?此处常见的一种模式是逗号出现在()中。

2 个答案:

答案 0 :(得分:1)

您的问题没有模式来从后向拆分字符串。以下是一种变通方法,适用于此特定情况。我使用后向操作除以“可用”。在数据框爆炸中尝试一下

scala> "item (foo bar) is available, soaps".split("(?<=available),")
res41: Array[String] = Array(item (foo bar) is available, " soaps")

scala> "item (foo bar) is available, soaps".split("(?<=available),").length
res42: Int = 2

scala> "item (foo bar, bar) is available".split("(?<=available),")
res44: Array[String] = Array(item (foo bar, bar) is available)

scala> "item (foo bar, bar) is available".split("(?<=available),").length
res45: Int = 1

EDIT1

scala> "item (foo bar, bar) is empty, (soap, shampoo)".split("(?<=available|empty),").length
res1: Int = 2

scala>

答案 1 :(得分:1)

使用UDF并受Regex to match only commas not in parentheses?的启发:

val df = List(
  ("item (foo bar) is available, soaps", true),
  ("item (bar) is available", false),
  ("soaps, shampoo", false),
  ("item (foo bar, bar) is available", true),
  ("item (foo bar, bar) is available, (soap, shampoo)", true)
).
  toDF("itemNames", "coupons")
df.show(false)

val regex = Pattern.compile(
  ",         # Match a comma\n" +
    "(?!       # only if it's not followed by...\n" +
    " [^(]*    #   any number of characters except opening parens\n" +
    " \\)      #   followed by a closing parens\n" +
    ")         # End of lookahead",
  Pattern.COMMENTS)

val customSplit = (value: String) => regex.split(value)
val customSplitUDF = udf(customSplit)
val result = df.withColumn("itemNames", explode(customSplitUDF($"itemNames")))
result.show(false)

输出为:

+--------------------------------+-------+
|itemNames                       |coupons|
+--------------------------------+-------+
|item (foo bar) is available     |true   |
| soaps                          |true   |
|item (bar) is available         |false  |
|soaps                           |false  |
| shampoo                        |false  |
|item (foo bar, bar) is available|true   |
|item (foo bar, bar) is available|true   |
| (soap, shampoo)                |true   |
+--------------------------------+-------+

如果需要“修剪”,可以轻松地将其添加到“ customSplit”中。