Flink消耗S3 Parquet File Kyro序列化错误

时间:2019-01-21 14:03:13

标签: protocol-buffers apache-flink parquet

我们要使用s3中的实木复合地板文件

我的代码段是这样的。我的输入文件是protobuf编码的实木复合地板文件。 protobuf类是Pageview.class。

import com.twitter.chill.protobuf.ProtobufSerializer;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.scala.hadoop.mapreduce.HadoopInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.parquet.proto.ProtoParquetInputFormat;
import org.apache.hadoop.fs.Path;
import scala.Tuple2;

public class ParquetReadJob {
    public static void main(String... args) throws Exception {

        ExecutionEnvironment ee = ExecutionEnvironment.getExecutionEnvironment();
        ee.getConfig().registerTypeWithKryoSerializer(StandardLog.Pageview.class, ProtobufSerializer.class);
        String path = args[0];

        Job job = Job.getInstance();
        job.setInputFormatClass(ProtoParquetInputFormat.class);

        HadoopInputFormat<Void, StandardLog.Pageview> hadoopIF =
                new HadoopInputFormat<> (new ProtoParquetInputFormat<>(), Void.class, StandardLog.Pageview.class, job);



        ProtoParquetInputFormat.addInputPath(job, new Path(path));
        DataSource<Tuple2<Void, StandardLog.Pageview>> dataSet = ee.createInput(hadoopIF).setParallelism(10);


        dataSet.print();
    }
}

总有错误:

com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException
Serialization trace:
supportCrtSize_ (access.Access$AdPositionInfo)
adPositionInfo_ (access.Access$AccessRequest)
accessRequest_ (com.adshonor.proto.StandardLog$Pageview$Builder)
    at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
    at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:730)
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:22)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
    at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
    at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:730)
    at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
    at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
    at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
    at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:315)
    at org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
    at org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)
    at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72)
    at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
    at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
    at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:216)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException
    at java.util.Collections$UnmodifiableCollection.add(Collections.java:1055)
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
    at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:22)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
    at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
    ... 23 more

有人可以建议我如何编写可使用此类文件的批处理程序吗?

1 个答案:

答案 0 :(得分:0)

我也遇到了这个问题。 我在flink-protobuf的待处理PR中找到了thisthis,从而解决了该问题。

您需要将NonLazyProtobufSerializerProtobufKryoSerializer类添加到项目中 并将NonLazyProtobufSerializer注册为Message类型的默认Kryo序列化程序:

env.getConfig().addDefaultKryoSerializer(Message.class, NonLazyProtobufSerializer.class);

来自JavaDocs的作者:

  

这是解决在Flink中使用来自Kafka的数据源时出现的问题的解决方法    TableEnvironment。对于在.proto中声明为'string'类型的字段,    Java类已声明类型为“对象”。这些字段返回的对象的实际类型    Message.parseFrom(byte [])是“ ByteArray”。但是这些字段的getter方法返回“ String”,    必要时用字符串懒惰地替换基础ByteArray字段。

希望这会有所帮助。