Question

我有一个如下所示的数据集：

! Hello World.  1
" Hi there. 0

我想要做的是从每行的开头删除所有特殊字符（仅从头开始，而不是其他特殊字符）。

为了读取数据（以制表符分隔），我使用以下代码：

val data = sparkSession.read.format("com.databricks.spark.csv")
    .option("delimiter", "\t")
    .load("data.txt")

val columns = Seq("text", "class")
val df = data.toDF(columns: _*)

我知道我应该使用replaceAll()，但我不太清楚如何做到这一点。

Answer 1

您可以创建udf并将其应用于数据框的第一列，以删除前导特殊字符：

val df = Seq(("! Hello World.", 1), ("\" Hi there.", 0)).toDF("text", "class")

df.show
+--------------+-----+
|          text|class|
+--------------+-----+
|! Hello World.|    1|
|   " Hi there.|    0|
+--------------+-----+    


import org.apache.spark.sql.functions.udf
                                                           ^
// remove leading non-word characters from a string
def remove_leading: String => String = _.replaceAll("^\\W+", "")    
val udf_remove = udf(remove_leading)

df.withColumn("text", udf_remove($"text")).show
+------------+-----+
|        text|class|
+------------+-----+
|Hello World.|    1|
|   Hi there.|    0|
+------------+-----+

Answer 2

可能会有所帮助

val str = " some string "
str.trim

或修剪某些特定字符

str.stripPrefix(",").stripSuffix(",").trim

或从前面删除一些字符

val ignoreable = ", \t\r\n"
str.dropWhile(c => ignorable.indexOf(c) >= 0)

可以找到所有带有字符串的有用操作at

从数据框行中删除特殊字符

2 个答案: