我需要大量的路径列,我需要分成2列,即basename和dirname。我知道如何使用
轻松获取路径的基本名称val df = Seq("/test/coucou/jambon/hello/file"
,"/test/jambon/test")
.toDF("column1")
df.withColumn("basename", substring_index($"column1" , "/", -1))
.show(2, false)
+------------------------------+---------+
|column1 |basename |
+------------------------------+---------+
|/test/coucou/jambon/hello/file|file |
|/test/jambon/test |test |
+------------------------------+---------+
但是我正在努力获取这样的目录名:
+------------------------------+--------------------------+
|column1 |dirname |
+------------------------------+--------------------------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello |
|/test/jambon/test |/test/jambon |
+------------------------------+--------------------------+
我尝试了各种解决方案,但是找不到功能性的柱状解决方案。
我最好的主意是将$"basename"
减去$"column1"
,但是我找不到在Spark中减去String的方法。
答案 0 :(得分:3)
使用正则表达式的已提供解决方案的替代方法
正确设置正则表达式。 regexp_extract UDF将为您提供所需的内容。
val df = Seq("/test/coucou/jambon/hello/file"
, "/test/jambon/prout/test")
.toDF("column1")
import org.apache.spark.sql.functions.regexp_extract
df.withColumn("path", regexp_extract('column1, "^\\/(\\w+\\/)+", 0)).withColumn("fileName",regexp_extract('column1, "\\w+$", 0)).show(false)
输出
+------------------------------+--------------------------+--------+
|column1 |path |fileName|
+------------------------------+--------------------------+--------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello/|file |
|/test/jambon/prout/test |/test/jambon/prout/ |test |
+------------------------------+--------------------------+--------+
修改:
无需在正斜杠后面加上斜杠,则更易于管理:
df.withColumn("path",regexp_extract($"column1", "^(.+)(/.+)$", 1 ) ) )
答案 1 :(得分:2)
您可以使用expr对column1进行子字符串化。该代码应如下所示。希望对您有所帮助。
//Creating Test Data
val df = Seq("/test/coucou/jambon/hello/file"
,"/test/jambon/prout/test")
.toDF("column1")
val test = df.withColumn("basename", substring_index($"column1" , "/", -1))
.withColumn("path", expr("substring(column1, 1, length(column1)-length(basename)-1)"))
test.show(false)
+------------------------------+--------+-------------------------+
|column1 |basename|path |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file |/test/coucou/jambon/hello|
|/test/jambon/prout/test |test |/test/jambon/prout |
+------------------------------+--------+-------------------------+
答案 2 :(得分:1)
另一种方法是使用UDF:
import org.apache.spark.sql.functions.udf
val pathUDF = udf((s: String) => s.substring(0, s.lastIndexOf("/")))
val test = df.withColumn("basename", substring_index($"column1" , "/", -1))
.withColumn("path", pathUDF($"column1"))
test.show(false)
+------------------------------+--------+-------------------------+
|column1 |basename|path |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file |/test/coucou/jambon/hello|
|/test/jambon/prout/test |test |/test/jambon/prout |
+------------------------------+--------+-------------------------+