Question

我首先使用带索引的zip读取带有多行和索引行的分隔文件。接下来，我尝试使用scala将从RDD [Row]创建的数据帧写入csv文件。

这是我的代码：

val FileDF = spark.read.csv(inputfilepath)

val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1) +: indexedRow._1.toSeq))

val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))

val dataframenew = spark.createDataFrame(rdd,FileDFWithSeqNo)

dataframenew.write.format("com.databricks.spark.csv").option("delimiter","|").save("C:\\Users\\path\\Desktop\\IndexedOutput")

其中dataframenew是最终的数据帧。

输入数据如下：

0|0001|10|1|6001825851|0|0|0000|0|003800543||2017-03-02 00:00:00|95|O|473|3.74|0.05|N|||5676|6001661630||473|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230979
0|0001|10|1|6001825853|0|0|0000|0|003811455||2017-03-02 00:00:00|95|O|90|15.14|0.55|N|||1080|6001661630||90|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230980
0|0001|10|1|6001825854|0|0|0000|0|003812898||2017-03-02 00:00:00|95|O|15|7.60|1.33|N|||720|6001661630||15|1|||UPS|2017-03-02 00:00:00|0.0000||0||20170303|793358|793358115230981

我用索引压缩以获取每一行的唯一标识符。

但是这给了我一个输出文件，其数据如下：

1001,"0|0001|10|1|6001825851|0|0|0000|PS|0|0.0000||0||20170303|793358|793358115230979",cabc

预期输出应为：

1001,0|0001|10|1|6001825851|0|0|0000|PS|0|0.0000||0||20170303|793358|793358115230979,cabc

为什么额外的引号会被添加到数据中，我该如何消除它？

将数据帧保存为文件时，数据中出现不需要的引号

0 个答案: