Question

我在使用spark和多行选项为true读取csv文件时遇到问题。在将多行设置为true或false时是否有任何条件？

使用Windows 10，scala 2.11.11和spark 2.2.0版本。

我用来测试的数据集： https://drive.google.com/file/d/15k7ffbyQZ8h_93t4G5Y1U2rPHSAyA9GX/view?usp=sharing

val df = sparkSession.read.format("csv")
      .option("header", "true")
      .option("inferSchema", true)
      .option("delimiter", ",")
      .option("multiLine", true)
      .option("wholeFile", true)
      .option("sep", ",")
      .option("ignoreLeadingWhiteSpace","true")
      .option("ignoreTrailingWhiteSpace","true")
      .option("encoding","utf-8")
      .option("quote","\"")
      .option("escape","\"")
      .load("C:/Notebook/work/input/Country.csv").repartition(2)

当多行选项为true时，我得到的计数为77。如果将多行选项设置为false，我将得到正确的计数-247

有人可以让我知道我在做什么错吗？

谢谢！

使用带有多行true选项的spark读取csv文件时计数不正确

0 个答案: