我正在尝试从Spark流应用程序中以Avro格式序列化的Kafka主题读取数据。
从byte []转换为GenericRecord时,出现异常。 我尝试打印字节数组的长度,结果显示为957。
当我转换
byte []为字符串类型
,我可以看到记录。我不确定为什么这是畸形数据异常消息。我看到记录中有一些拉丁字符。
我已经看过很多帖子,但是我没有找到合适的解决方案。
我当时使用Twitter Bijection API使用AVRO格式序列化数据。
Thera有一些建议使用DatumReader和DatumWriter的帖子,但是效果也不尽人意。
ERROR] 07-13-2018 08:35:41,793 com.example.DataTransformationStarter main 154- Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apach.avro.AvroRuntimeException: Malformed data. Length is negative: -62
at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:363)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:355)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:157)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at com.example.ProcessorFactory.getSourceType(ProcessorFactory.java:117)
at com.example.ProcessorFactory.getProcessor(ProcessorFactory.java:48)
at com.example.processRecord(RddMicroBatchProcessor.java:166)
at com.example.lambda$processEachBatch$65712684$1(RddMicroBatchProcessor.java:64)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
我也使用相同的模式进行反序列化和序列化。
private static SourceType getSourceType(byte[] scrmRecord){
Schema.Parser parser = new Schema.Parser();
// Schema schema = parser.parse(MDMCommonUtils.getBaseAvroSchema());
Schema schema = parser.parse(MDMCommonUtils.getCRTAvroSchema());
// DatumReader<GenericRecord> reader = new SpecificDatumReader<GenericRecord>(schema);
// Decoder decoder= DecoderFactory.get().binaryDecoder(scrmRecord, null);
// GenericRecord record = null;
// try {
// record = reader.read(null, decoder);
// } catch (IOException e) {
// e.printStackTrace();
// }
// return SourceType.fromString(record.get("srcsystemcd")!=null?record.get("srcsystemcd").toString():"");
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
BinaryDecoder decoder1 = DecoderFactory.get().binaryDecoder(scrmRecord, null);
GenericRecord record1 = null;
try {
record1 = datumReader.read(null, decoder1);
} catch (IOException e) {
e.printStackTrace();
}
String name = ((Utf8) record1.get("name")).toString();
return SourceType.fromString(record1.get("srcsystemcd")!=null?((Utf8) record1.get("name")).toString():"");
}
Schema :
{ "namespace": "example.avro",
"type": "record",
"name": "baseschema",
"fields": [
{"name": "srcsystemcd", "type": "string"}
]
}
源记录中有一些拉丁字符。
{"srcsystemcd":"T12","srcupdatedt":"2011-02-27 10:01:40.0","pkeysrcobject":"1234567","modeind":"test1","srcid":"CRT","svtid":"","srcstatuscd":"active","geocd":"ABCD","regioncd":"ABDCE","channelid":"NA","fl":"N","roletypeid":"12","hqfl":"Y","websiteurl":"","emailaddr":"","phonenum":"","faxnum":"","matchnm":"Základnín","dbanm":"","legalnm":"Základnín","deptnm":"","addr1txt":"abcde 11","addr2txt":"","storenum":"","citynm":"abcde","districtnm":"","countynm":"","stateprovincecd":"","stateprovincenm":"","postalcd":"1234","countryiso2cd":"CZ","countryiso3cd":"CZE","altdbanm":"","altlegalnm":"","altdeptnm":"","altaddr1txt":"","altaddr2txt":"","altstorenum":"","altcitynm":"","altdistrictnm":"","altcountynm":"","altstateprovincecd":"","altstateprovincenm":"","altpostalcd":"","altcountryiso2cd":"","altcountryiso3cd":"","altlangcd":"","recdeletefl":"N"}
这些是由于存在拉丁字符而引起的例外。任何指针和帮助都将不胜感激。
答案 0 :(得分:0)
我认为,这里的问题与拉丁字符无关。这里的问题是所使用的Serializer和Deserializer不匹配。
在主题上推送的数据已使用String进行了序列化,并使用ByeArrayDeserializer进行了反序列化。这种不匹配导致了问题。