I am trying to save the RDD with ISO-8859-1 charset encoded using saveAsNewAPIHadoopFile to AWS S3 bucket But its changing the character encoding to UTF-8 when its saved to S3 bucket.
Code snippet
val cell = “ MYCOST £25” //This is in UTF-8 character encoding .
val charset: Charset = Charset.forName(“ISO-8859-1”)
val cellData = cell.padTo(50, “ “).mkString
val iso-data = new String(cellData.getBytes(charset), charset) // here it converts the string from UTF-8 to ISO-8859-1
But when I save the file using saveAsNewAPIHadoopFile then it changes to UTF-8 format. I think saveAsNewAPIHadoopFile TextOutputFormat automatically converting the file encoding to UTF-8. Is there a way I can save the content to S3 bucket with the same encoding (ISO-8859-1)
ds.rdd.map { record =>
val cellData = record.padTo(50, “ “).mkString
new String(cellData.getBytes(“ISO-8859-1”), “ISO-8859-1”)
}.reduce { _ + _ }
}.mapPartitions { iter =>
val text = new Text()
iter.map { item =>
text.set(item)
(NullWritable.get(), text)
}
}.saveAsNewAPIHadoopFile(“”s3://mybucket/“, classOf[NullWritable], classOf[BytesWritable], classOf[TextOutputFormat[NullWritable, BytesWritable]])
Appreciate your help
答案 0 :(得分:0)
我仍然没有得到正确答案,但作为一种解决方法,我将文件复制到HDFS并使用ICONV将文件转换为ISO格式并保存回S3存储桶。这对我来说很重要,但它需要EMR集群中的两个步骤。 我认为对于遇到同样问题的人来说这可能是有用的