Spark:udf从路径获取目录名

时间:2019-04-03 09:34:19

标签: scala apache-spark

我需要大量的路径列,我需要分成2列,即basename和dirname。我知道如何使用

轻松获取路径的基本名称
val df = Seq("/test/coucou/jambon/hello/file"
    ,"/test/jambon/test")
    .toDF("column1")
df.withColumn("basename", substring_index($"column1"  , "/", -1))
.show(2, false)
+------------------------------+---------+
|column1                       |basename |
+------------------------------+---------+
|/test/coucou/jambon/hello/file|file     |
|/test/jambon/test             |test     |
+------------------------------+---------+

但是我正在努力获取这样的目录名:

+------------------------------+--------------------------+
|column1                       |dirname                   |
+------------------------------+--------------------------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello |
|/test/jambon/test             |/test/jambon              |
+------------------------------+--------------------------+

我尝试了各种解决方案,但是找不到功能性的柱状解决方案。
我最好的主意是将$"basename"减去$"column1",但是我找不到在Spark中减去String的方法。

3 个答案:

答案 0 :(得分:3)

使用正则表达式的已提供解决方案的替代方法

正确设置正则表达式。 regexp_extract UDF将为您提供所需的内容。

   val df = Seq("/test/coucou/jambon/hello/file"
      , "/test/jambon/prout/test")
      .toDF("column1")

    import org.apache.spark.sql.functions.regexp_extract

    df.withColumn("path", regexp_extract('column1, "^\\/(\\w+\\/)+", 0)).withColumn("fileName",regexp_extract('column1, "\\w+$", 0)).show(false)

输出

+------------------------------+--------------------------+--------+
|column1                       |path                      |fileName|
+------------------------------+--------------------------+--------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello/|file    |
|/test/jambon/prout/test       |/test/jambon/prout/       |test    |
+------------------------------+--------------------------+--------+

修改:
无需在正斜杠后面加上斜杠,则更易于管理:

df.withColumn("path",regexp_extract($"column1", "^(.+)(/.+)$", 1 ) ) )

答案 1 :(得分:2)

您可以使用expr对column1进行子字符串化。该代码应如下所示。希望对您有所帮助。

//Creating Test Data
val df = Seq("/test/coucou/jambon/hello/file"
  ,"/test/jambon/prout/test")
  .toDF("column1")

val test = df.withColumn("basename", substring_index($"column1"  , "/", -1))
    .withColumn("path", expr("substring(column1, 1, length(column1)-length(basename)-1)"))

test.show(false)
+------------------------------+--------+-------------------------+
|column1                       |basename|path                     |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file    |/test/coucou/jambon/hello|
|/test/jambon/prout/test       |test    |/test/jambon/prout       |
+------------------------------+--------+-------------------------+

答案 2 :(得分:1)

另一种方法是使用UDF:

import org.apache.spark.sql.functions.udf

val pathUDF = udf((s: String) => s.substring(0, s.lastIndexOf("/")))

val test = df.withColumn("basename", substring_index($"column1"  , "/", -1))
    .withColumn("path", pathUDF($"column1"))

test.show(false)
+------------------------------+--------+-------------------------+
|column1                       |basename|path                     |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file    |/test/coucou/jambon/hello|
|/test/jambon/prout/test       |test    |/test/jambon/prout       |
+------------------------------+--------+-------------------------+