Question

我有一些我生成的Avro类，现在我正尝试在Spark中使用它们。所以我导入了我的avro生成的java类“twitter_schema”，并在反序列化时引用它。似乎工作但最后得到Cast异常。

我的架构：

$ more twitter.avsc

{“type”：“record”，“name”：“twitter_schema”，“namespace”：   “com.miguno.avro”，“fields”：[{       “名称”：“用户名”，       “type”：“string”，       “doc”：“Twitter.com上的用户帐户名称”}，{       “名字”：“推特”，       “type”：“string”，       “doc”：“用户Twitter消息的内容”}，{       “名称”：“时间戳”，       “type”：“long”，       “doc”：“Unix纪元时间以秒为单位”}，“doc：”：“存储Twitter消息的基本架构”}

我的代码：

import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.avro.mapred.AvroKey
import org.apache.hadoop.io.NullWritable
import org.apache.avro.mapred.AvroInputFormat
import org.apache.avro.mapred.AvroWrapper
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import com.miguno.avro.twitter_schema

val path = "/app/avro/data/twitter.avro"
val conf = new Configuration
var avroRDD = sc.newAPIHadoopFile(path,classOf[AvroKeyInputFormat[twitter_schema]], 
classOf[AvroKey[ByteBuffer]], classOf[NullWritable], conf)
var avroRDD = sc.hadoopFile(path,classOf[AvroInputFormat[twitter_schema]], 
classOf[AvroWrapper[twitter_schema]], classOf[NullWritable], 5)

avroRDD.map(l => { 
      //transformations here
      new String(l._1.datum.username)
}
).first

我在最后一行收到错误：

scala> avroRDD.map(l => { 
     |       new String(l._1.datum.username)}).first
<console>:30: error: overloaded method constructor String with alternatives:
  (x$1: StringBuilder)String <and>
  (x$1: StringBuffer)String <and>
  (x$1: Array[Byte])String <and>
  (x$1: Array[Char])String <and>
  (x$1: String)String
 cannot be applied to (CharSequence)
                    new String(l._1.datum.username)}).first

我做错了什么 - 不理解错误？这是反序列化的正确方法吗？我读到了Kryo，但似乎增加了复杂性，并阅读了关于在1.2中接受Avro的Spark SQL上下文，但这听起来像是一个性能困难/解决方法..这个人的最佳实践？

感谢，马特

Answer 1

我认为你的问题是avro已经将字符串反序列化为CharSequence但是引发了预期的java String。 Avro有三种方法可以在java中反序列化字符串：进入CharSequence，进入String并进入UTF8（用于存储字符串的avro类，有点像Hadoop的Text）。

您可以通过在avro架构中添加“avro.java.string”属性来控制它。可能的值是（区分大小写）：“String”，“CharSequence”，“Utf8”。可能有一种方法可以通过输入格式动态控制它，但我不确切知道。

Answer 2

好了，因为CharSequence是String的接口，我可以保持我的Avro架构，只需通过toString（）将我的Avro字符串设为String，即：

scala> avroRDD.map(l => {
     | new String(l._1.datum.get("username").toString())
     | } ).first
res2: String = miguno

尝试使用特定类型反序列化Spark中的Avro

2 个答案: