Question

当我从通过HQL创建的Dataframe中通过FileUtil.copyMerge函数合并部分文件时，我在HDFS上的最终文本文件中获取""空格和空值。这就是我在做的事情。

import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.apache.hadoop.conf.Configuration

val dir = "your/output/folder"

val selectedData= hiveContext.sql(executeHQL)

// Pick the header  
val selectedDataHeader = selectedData.limit(0)

// write header to a file for merging   
selectedDataHeader.write
  .format("csv")
  .option("header", "true").option("delimiter", "\t")
  .save(dir + "/header")

// create the part files  
selectedData.write
  .format("csv")
  .option("header", "false").option("delimiter", "\t")
  .save(dir + "/parts")     

val configuration = new Configuration()
val fs = FileSystem.get(configuration)

// merge header and all part files to one 
FileUtil.copyMerge(
  fs, new Path(dir +  "/header"),
  fs, new Path(dir  +"/parts/header"),
  false, configuration, "")

// Write extracts to file
FileUtil.copyMerge(
  fs, new Path(dir + "/parts/"),
  fs, new Path(dir + "/destination.txt"),
  false, configuration, null)

问题出现在/destination.txt上，如下所示

""    ABC  XYZ PQR  ""  XYZ2  
ROW2  ABC2 XYZ PQR2 ""  XYZ3

如何删除空格周围的""或空值？请注意，零件文件不显示＆＃34;＆＃34;虽然。

删除空格周围的双引号，并在Scala Spark上的copyMerge输出tsv文件上为null

0 个答案: