当我将CONCAT
应用于dataframe
中的spark sql
并将dataframe
作为csv文件存储在HDFS
位置时,似乎额外的双引号仅在输出文件中添加到concat
列。
当我appy show时,不会添加此双引号。仅当我将dataframe
存储为csv文件时才会添加此双引号
似乎我需要删除在将dataframe
保存为csv文件时添加的额外双引号。
我正在使用com.databricks:spark-csv_2.10:1.1.0
jar
Spark版本是1.5.0-cdh5.5.1
输入:
campaign_file_name_1, campaign_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, 1
campaign_file_name_1, campaign_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, 2
预期产出:
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, campaign_name_1"="1, 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, campaign_name_1"="2, 2017-06-06 17:09:31
Spark Code:
object campaignResultsMergerETL extends BaseETL {
val now = ApplicationUtil.getCurrentTimeStamp()
val conf = new Configuration()
val fs = FileSystem.get(conf)
val log = LoggerFactory.getLogger(this.getClass.getName)
def main(args: Array[String]): Unit = {
//---------------------
code for sqlContext Initialization
//---------------------
val campaignResultsDF = sqlContext.read.format("com.databricks.spark.avro").load(campaignResultsLoc)
campaignResultsDF.registerTempTable("campaign_results")
val campaignGroupedDF = sqlContext.sql(
"""
|SELECT campaign_file_name,
|campaign_name,
|tracker_id,
|SUM(campaign_measure) AS campaign_measure
|FROM campaign_results
|GROUP BY campaign_file_name,campaign_name,tracker_id
""".stripMargin)
campaignGroupedDF.registerTempTable("campaign_results_full")
val campaignMergedDF = sqlContext.sql(
s"""
|SELECT campaign_file_name,
|tracker_id,
|CONCAT(campaign_name,'\"=\"' ,campaign_measure),
|"$now" AS audit_timestamp
|FROM campaign_results_full
""".stripMargin)
campaignMergedDF.show(20)
saveAsCSVFiles(campaignMergedDF, campaignResultsExportLoc, numPartitions)
}
def saveAsCSVFiles(campaignMeasureDF:DataFrame,hdfs_output_loc:String,numPartitions:Int): Unit =
{
log.info("saveAsCSVFile method started")
if (fs.exists(new Path(hdfs_output_loc))){
fs.delete(new Path(hdfs_output_loc), true)
}
campaignMeasureDF.repartition(numPartitions).write.format("com.databricks.spark.csv").save(hdfs_output_loc)
log.info("saveAsCSVFile method ended")
}
}
campaignMergedDF.show(20)
的结果是正确的,并且正常。
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, campaign_name_1"="1, 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, campaign_name_1"="2, 2017-06-06 17:09:31
saveAsCSVFiles
的结果:这是不正确的。
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, "campaign_name_1""=""1", 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, "campaign_name_1""=""2", 2017-06-06 17:09:31
有人可以帮我解决这个问题吗?
答案 0 :(得分:1)
使用时
write.format("com.databricks.spark.csv").save(hdfs_output_loc)
为了将包含"
的文本写入csv文件,您将面临问题,因为{em> spark-csv <将"
符号定义为默认引号/ p>
将"
中的默认引号替换为其他内容(例如NULL)应该允许您按原样将"
写入文件。
write.format("com.databricks.spark.csv").option("quote", "\u0000").save(hdfs_output_loc)
<强>解释强>
您使用的是默认的spark-csv:
\
"
This answer建议如下:
关闭双引号字符的默认转义的方法 (“)用反斜杠字符() - 即避免全部逃逸 完全是字符,你必须使用just添加.option()方法调用 .write()方法调用后的正确参数。的目标 option()方法调用是改变csv()方法“找到”的方式 发出内容的“引用”字符的实例。至 要做到这一点,你必须改变“引用”实际意味着的默认值; 即改变寻求双引号字符的字符 (“)为Unicode”\ u0000“字符(基本上提供Unicode NUL字符假设它不会出现在文档中。)