尝试在Spark-2.2.0中将CSV数据读入数据帧。具有数据的单元格具有多行文本,并且第一行具有双引号中的少量单词。以下是使用的代码。试过很多选择,但没有真正有效。
df = (sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema","true")
.option("multiLine","true")
.option("quoteMode","ALL")
.option("mode","PERMISSIVE")
.option("ignoreLeadingWhiteSpace","true")
.option("ignoreTrailingWhiteSpace","true")
.option("parserLib","UNIVOCITY")
.load("C:/Desktop/testing.csv"))
这是我们尝试从文件中读取的数据。第一个单元格有三行数据。
输入数据:
+----------------------------------------+------------------------+
| text| time|
+----------------------------------------+------------------------+
|#Word #This "are acting though." | 08-11-2016 05:47:00 |
|This is the | |
|Not so. | |
+----------------------------------------+------------------------+
|I'm not sure if I have any left | 08-11-2016 05:48:00 |
+----------------------------------------+------------------------+
|bob day is an honest person | 08-11-2016 05:49:00 |
|"a loss to the senate" | |
+----------------------------------------+------------------------+
作业运行时没有错误,但读取的数据不正确。它以下面的方式阅读。
输出:
+----------------------------------------+------------------------+
| text| time|
+----------------------------------------+------------------------+
|\#Word #This \"\"are acting though.\"\""| |
+----------------------------------------+------------------------+
|This is the | |
|Not so.\",08-11-2016 05:47:00 | |
+----------------------------------------+------------------------+
|I'm not sure if I have any left | 08-11-2016 05:48:00 |
+----------------------------------------+------------------------+
|\bob day is an honest person | 08-11-2016 05:49:00 |
|\"\"a loss to the senate\"\"\"" | |
+----------------------------------------+------------------------+
由于线条在双引号后被分成两行而且看到很少" \"。时间戳也动了。
答案 0 :(得分:0)
根据此link,您应将wholeFile
选项设置为True
,以转义escape
指定的字符之间的换行符。但是,您似乎没有转义包含换行符的文本,因此这可能不起作用。您应该重新格式化源代码,以便引用包含换行符的文本。