saveAsNewAPIHadoopFile changing the character encoding to UTF-8

时间:2017-08-04 13:05:55

标签: hadoop apache-spark amazon-s3 utf-8 character-encoding

I am trying to save the RDD with ISO-8859-1 charset encoded using saveAsNewAPIHadoopFile to AWS S3 bucket But its changing the character encoding to UTF-8 when its saved to S3 bucket.

Code snippet

val cell = “ MYCOST £25” //This is in UTF-8 character encoding .  
val charset: Charset = Charset.forName(“ISO-8859-1”)    
val cellData = cell.padTo(50, “ “).mkString   

val  iso-data = new String(cellData.getBytes(charset), charset) // here it converts the string from UTF-8 to ISO-8859-1

But when I save the file using saveAsNewAPIHadoopFile then it changes to UTF-8 format. I think saveAsNewAPIHadoopFile TextOutputFormat automatically converting the file encoding to UTF-8. Is there a way I can save the content to S3 bucket with the same encoding (ISO-8859-1)

ds.rdd.map { record =>  
    val cellData = record.padTo(50, “ “).mkString  
    new String(cellData.getBytes(“ISO-8859-1”), “ISO-8859-1”)
 }.reduce { _ + _ }
    }.mapPartitions { iter =>
      val text = new Text()
      iter.map { item =>
        text.set(item)
        (NullWritable.get(), text)
      }
    }.saveAsNewAPIHadoopFile(“”s3://mybucket/“, classOf[NullWritable], classOf[BytesWritable], classOf[TextOutputFormat[NullWritable, BytesWritable]])

Appreciate your help

1 个答案:

答案 0 :(得分:0)

我仍然没有得到正确答案,但作为一种解决方法,我将文件复制到HDFS并使用ICONV将文件转换为ISO格式并保存回S3存储桶。这对我来说很重要,但它需要EMR集群中的两个步骤。 我认为对于遇到同样问题的人来说这可能是有用的