删除空格周围的双引号,并在Scala Spark上的copyMerge输出tsv文件上为null

时间:2018-02-13 20:51:21

标签: scala csv hdfs spark-dataframe

当我从通过HQL创建的Dataframe中通过FileUtil.copyMerge函数合并部分文件时,我在HDFS上的最终文本文件中获取""空格和空值。这就是我在做的事情。

import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.apache.hadoop.conf.Configuration

val dir = "your/output/folder"

val selectedData= hiveContext.sql(executeHQL)

// Pick the header  
val selectedDataHeader = selectedData.limit(0)

// write header to a file for merging   
selectedDataHeader.write
  .format("csv")
  .option("header", "true").option("delimiter", "\t")
  .save(dir + "/header")

// create the part files  
selectedData.write
  .format("csv")
  .option("header", "false").option("delimiter", "\t")
  .save(dir + "/parts")     

val configuration = new Configuration()
val fs = FileSystem.get(configuration)

// merge header and all part files to one 
FileUtil.copyMerge(
  fs, new Path(dir +  "/header"),
  fs, new Path(dir  +"/parts/header"),
  false, configuration, "")

// Write extracts to file
FileUtil.copyMerge(
  fs, new Path(dir + "/parts/"),
  fs, new Path(dir + "/destination.txt"),
  false, configuration, null)

问题出现在/destination.txt上,如下所示

""    ABC  XYZ PQR  ""  XYZ2  
ROW2  ABC2 XYZ PQR2 ""  XYZ3

如何删除空格周围的""或空值?请注意,零件文件不显示""虽然。

0 个答案:

没有答案