可以处理Spark中的多字符定界符

时间:2018-08-29 18:15:20

标签: scala apache-spark databricks

对于正在读取的某些csv文件,我以[~]作为分隔符。

1[~]a[~]b[~]dd[~][~]ww[~][~]4[~]4[~][~][~][~][~]

我已经尝试过

val rddFile = sc.textFile("file.csv")
val rddTransformed = rddFile.map(eachLine=>eachLine.split("[~]"))
val df = rddTransformed.toDF()
display(df)

然而,与此相关的问题是,它作为单个值数组出现,每个字段中都有[]。因此数组将是

["1[","]a[","]b[",...]

我不能使用

val df = spark.read.option("sep", "[~]").csv("file.csv")

因为不支持多字符分隔符。我还能采取什么其他方法?

1[~]a[~]b[~]dd[~][~]ww[~][~]4[~]4[~][~][~][~][~]
2[~]a[~]b[~]dd[~][~]ww[~][~]4[~]4[~][~][~][~][~]
3[~]a[~]b[~]dd[~][~]ww[~][~]4[~]4[~][~][~][~][~]

编辑-这不是重复项,重复的线程涉及多个定界符,这是多字符单个定界符

1 个答案:

答案 0 :(得分:3)

val df = spark.read.format("csv").load("inputpath")
df.rdd.map(i => i.mkString.split("\\[\\~\\]")).toDF().show(false)

尝试以下

您的另一项要求

val df1 = df.rdd.map(i => i.mkString.split("\\[\\~\\]").mkString(",")).toDF()
val iterationColumnLength = df1.rdd.first.mkString(",").split(",").length
df1.withColumn("value",split(col("value"),",")).select((0 until iterationColumnLength).map(i => col("value").getItem(i).as("col_" + i)): _*).show

enter image description here