Spark忽略字符串中的逗号

时间:2018-02-07 13:06:06

标签: apache-spark

尝试通过spark会话加载csv但遇到包含双引号和逗号内部字符串的问题.i.e。

"""A"" STAR ACCOUNTING,& TRAINING SOLUTIONS LIMITED"

这将根据上面的字符串创建具有2个不同列的数据框,输出:

"""A"" STAR ACCOUNTING 
& TRAINING SOLUTIONS LIMITED"

通过spark session读取csv读取csv

val df = ss.read
          .option("header", true)
          .option("ignoreLeadingWhiteSpace", "true")
          .csv(csvFile)
          .sort(id)

无论如何要读取csv文件并在字符串中跳过逗号?

1 个答案:

答案 0 :(得分:1)

看起来您的数据使用"作为转义字符,而默认值为\。您应该在阅读时提供quote选项:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.option("escape", "\"").csv(Seq("\"\"\"A\"\" STAR ACCOUNTING,& TRAINING SOLUTIONS LIMITED").toDS).show(false)
+------------------------------------------------+
|_c0                                             |
+------------------------------------------------+
|"A" STAR ACCOUNTING,& TRAINING SOLUTIONS LIMITED|
+------------------------------------------------+