我有一张桌子:
id itemNames优惠券 有1件(foo bar)可用,肥皂为真 2项(栏)可用false 3肥皂,洗发水假 4项(foo bar,bar)可用true 5项(foo bar,bar)可用(肥皂,洗发水)true 6空false
我想将其爆炸
id itemNames优惠券
1个项目(foo bar)可用true
1条肥皂
2项(栏)可用false
3肥皂假
3耻辱假
4项(foo bar,bar)可用true
5项(foo bar,bar)可用true
6(肥皂,洗发水)真
6 null true
当我这样做时:
df.withColumn("itemNames", explode(split($"itemNames", "[,]")))
我得到了:
itemNames coupons
item (foo bar) is available true
soaps true
item (bar) is available false
soaps false
shampoo false
item (foo bar, true
bar) is available true
(soap, true
shampoo) true
有人可以告诉我我做错了什么,我该如何纠正?此处常见的一种模式是逗号出现在()中。
答案 0 :(得分:1)
您的问题没有模式来从后向拆分字符串。以下是一种变通方法,适用于此特定情况。我使用后向操作除以“可用”。在数据框爆炸中尝试一下
scala> "item (foo bar) is available, soaps".split("(?<=available),")
res41: Array[String] = Array(item (foo bar) is available, " soaps")
scala> "item (foo bar) is available, soaps".split("(?<=available),").length
res42: Int = 2
scala> "item (foo bar, bar) is available".split("(?<=available),")
res44: Array[String] = Array(item (foo bar, bar) is available)
scala> "item (foo bar, bar) is available".split("(?<=available),").length
res45: Int = 1
EDIT1
scala> "item (foo bar, bar) is empty, (soap, shampoo)".split("(?<=available|empty),").length
res1: Int = 2
scala>
答案 1 :(得分:1)
使用UDF并受Regex to match only commas not in parentheses?的启发:
val df = List(
("item (foo bar) is available, soaps", true),
("item (bar) is available", false),
("soaps, shampoo", false),
("item (foo bar, bar) is available", true),
("item (foo bar, bar) is available, (soap, shampoo)", true)
).
toDF("itemNames", "coupons")
df.show(false)
val regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS)
val customSplit = (value: String) => regex.split(value)
val customSplitUDF = udf(customSplit)
val result = df.withColumn("itemNames", explode(customSplitUDF($"itemNames")))
result.show(false)
输出为:
+--------------------------------+-------+
|itemNames |coupons|
+--------------------------------+-------+
|item (foo bar) is available |true |
| soaps |true |
|item (bar) is available |false |
|soaps |false |
| shampoo |false |
|item (foo bar, bar) is available|true |
|item (foo bar, bar) is available|true |
| (soap, shampoo) |true |
+--------------------------------+-------+
如果需要“修剪”,可以轻松地将其添加到“ customSplit”中。