当我从通过HQL创建的Dataframe中通过FileUtil.copyMerge
函数合并部分文件时,我在HDFS上的最终文本文件中获取""
空格和空值。这就是我在做的事情。
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.apache.hadoop.conf.Configuration
val dir = "your/output/folder"
val selectedData= hiveContext.sql(executeHQL)
// Pick the header
val selectedDataHeader = selectedData.limit(0)
// write header to a file for merging
selectedDataHeader.write
.format("csv")
.option("header", "true").option("delimiter", "\t")
.save(dir + "/header")
// create the part files
selectedData.write
.format("csv")
.option("header", "false").option("delimiter", "\t")
.save(dir + "/parts")
val configuration = new Configuration()
val fs = FileSystem.get(configuration)
// merge header and all part files to one
FileUtil.copyMerge(
fs, new Path(dir + "/header"),
fs, new Path(dir +"/parts/header"),
false, configuration, "")
// Write extracts to file
FileUtil.copyMerge(
fs, new Path(dir + "/parts/"),
fs, new Path(dir + "/destination.txt"),
false, configuration, null)
问题出现在/destination.txt上,如下所示
"" ABC XYZ PQR "" XYZ2
ROW2 ABC2 XYZ PQR2 "" XYZ3
如何删除空格周围的""
或空值?请注意,零件文件不显示""虽然。