Question

我需要在'|'上拆分列值，请为新列“地址”获取除第一项外的所有项。更为复杂的是，项目的数量并不总是相同的！

df1 = spark.createDataFrame([
  ["Luc  Krier|2363  Ryan Road"],
  ["Jeanny  Thorn|2263 Patton Lane|Raleigh North Carolina"],
  ["Teddy E Beecher|2839 Hartland Avenue|Fond Du Lac Wisconsin|US"],
  ["Philippe  Schauss|1 Im Oberdor|Allemagne"],
 ["Meindert I Tholen|Hagedoornweg 138|Amsterdam|NL"]
]).toDF("s")

我已经尝试过了：

拆分，大小为子字符串，但无法完成。任何帮助，不胜感激！

预期输出：

addres
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
2363  Ryan Road"
2263 Patton Lane|Raleigh North Carolina"
2839 Hartland Avenue|Fond Du Lac Wisconsin|US"
1 Im Oberdor|Allemagne"
Hagedoornweg 138|Amsterdam|NL"

Answer 1

尝试一下

df1.select(concat_ws('|',slice(split('s','\|'),2,1000))).show()

+------------------------------------------+
|concat_ws(|, slice(split(s, \|), 2, 1000))|
+------------------------------------------+
|2363  Ryan Road|Long Lake South Dakota    |
|2263 Patton Lane|Raleigh North Carolina   |
|2839 Hartland Avenue|Fond Du Lac Wisconsin|
|1 Im Oberdor|Allemagne                    |
|Hagedoornweg 138|Amsterdam                |
+------------------------------------------+

其中1000是数组的max_length，目前已为任意大整数。

Answer 2

函数'instr'可用于查找第一个'|'，而'substring'可用于获取结果：

df1.selectExpr(
  "substring(s, instr(s,'|') + 1, length(s))"
)

或者从字符串开头到第一个'|'的regexpr：

df1.select(
  regexp_replace($"s", "^[^\\|]+\\|", "")
)

在“ |”上拆分DataFrame列值并获得除第一项外的所有项目

2 个答案: