未正确分组的项目 - CoGroupByKey

时间:2016-05-27 02:06:09

标签: apache-flink apache-beam

CoGroupByKey问题

数据描述。

我有两个数据集。

  • 记录 - 第一个,每个(key,day)包含大约0.5-1M的记录。为了测试我使用2-3键和5-10天的数据。我拍的是1000多把钥匙。每条记录包含以μ秒为单位的密钥,时间戳和其他一些数据。
  • 配置 - 第二个,相当小。它及时描述了密钥,例如:您可以将其视为元组列表:(key, start date, end date, description)

对于探索,我将数据编码为长度为前缀的协议缓冲区二进制编码消息的文件。此外,文件包含gzip。数据按日期分片。每个文件大约10MB。

管道

我使用Apache Beam来表达管道。

  1. 首先,我将密钥添加到两个数据集中。对于Records数据集,它是(key, day rounded timestamp)。对于配置,密钥为(key, day),其中日期为start dateend date之间的每个时间戳值(指向午夜)。
  2. 使用CoGroupByKey合并数据集。
  3. 作为密钥类型,我使用org.apache.flink.api.java.tuple.Tuple2与来自回购github.com/orian/tuple-coderTuple2Coder

    问题

    如果记录数据集很小,就像5天一样,一切似乎都很好(检查normal_run.log)。

    INFO [main] (FlinkPipelineRunner.java:124) - Final aggregator values:
    INFO [main] (FlinkPipelineRunner.java:127) - item count : 4322332
    INFO [main] (FlinkPipelineRunner.java:127) - missing val1 : 0
    INFO [main] (FlinkPipelineRunner.java:127) - multiple val1 : 0
    

    当我运行10天以上的管道时,我遇到一个错误,指出某些记录没有配置(wrong_run.log)。

    INFO [main] (FlinkPipelineRunner.java:124) - Final aggregator values:
    INFO [main] (FlinkPipelineRunner.java:127) - item count : 8577197
    INFO [main] (FlinkPipelineRunner.java:127) - missing val1 : 6
    INFO [main] (FlinkPipelineRunner.java:127) - multiple val1 : 0
    

    然后我添加了一些额外的日志消息:

    (a.java:144) - 68643 items for KeyValue3 on: 1462665600000000
    (a.java:140) - no items for KeyValue3 on: 1463184000000000
    (a.java:123) - missing for KeyValue3 on: 1462924800000000
    (a.java:142) - 753707 items for KeyValue3 on: 1462924800000000 marked as no-loc
    (a.java:123) - missing for KeyValue3 on: 1462752000000000
    (a.java:142) - 749901 items for KeyValue3 on: 1462752000000000 marked as no-loc
    (a.java:144) - 754578 items for KeyValue3 on: 1462406400000000
    (a.java:144) - 751574 items for KeyValue3 on: 1463011200000000
    (a.java:123) - missing for KeyValue3 on: 1462665600000000
    (a.java:142) - 754758 items for KeyValue3 on: 1462665600000000 marked as no-loc
    (a.java:123) - missing for KeyValue3 on: 1463184000000000
    (a.java:142) - 694372 items for KeyValue3 on: 1463184000000000 marked as no-loc
    

    您可以发现第一行68643项已处理KeyValue3和时间1462665600000000.
    稍后在第9行中,操作似乎再次处理相同的密钥,但它报告没有可用于这些记录的配置 第10行通知他们已被标记为no-loc。

    第2行说KeyValue3和时间1463184000000000没有项目,但是在第11行你可以看到这个(key,day)对的项目后来被处理了,而且它们缺少一个Config。

    一些线索

    在其中一次探索运行期间,我遇到了异常(exception_thrown.log)。

    05/26/2016 03:49:49 GroupReduce (GroupReduce at GroupByKey)(1/5) switched to FAILED
    java.lang.Exception: The data preparation for task 'GroupReduce (GroupReduce at GroupByKey)' , caused an error: Error obtaining the sorted input: Thread 'SortMerger spilling thread' terminated due to an exception: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
      at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:455)
      at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
      at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
      at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 'SortMerger spilling thread' terminated due to an exception: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
      at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
      at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1079)
      at org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:94)
      at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450)
      ... 3 more
    Caused by: java.io.IOException: Thread 'SortMerger spilling thread' terminated due to an exception: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
      at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:799)
    Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
      at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
      at org.apache.flink.runtime.operators.sort.LargeRecordHandler.finishWriteAndSortKeys(LargeRecordHandler.java:263)
      at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$SpillingThread.go(UnilateralSortMerger.java:1409)
      at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:796)
    Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
      at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:799)
    Caused by: java.lang.IllegalAccessError: tried to access field com.esotericsoftware.kryo.io.Input.inputStream from class org.apache.flink.api.java.typeutils.runtime.NoFetchingInput
      at org.apache.flink.api.java.typeutils.runtime.NoFetchingInput.readBytes(NoFetchingInput.java:122)
      at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:297)
      at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
      at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
      at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:706)
      at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
      at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
      at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
      at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:228)
      at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:242)
      at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:144)
      at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
      at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:144)
      at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
      at org.apache.flink.runtime.io.disk.InputViewIterator.next(InputViewIterator.java:43)
      at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ReadingThread.go(UnilateralSortMerger.java:973)
      at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:796)
    

    解决方法(经过更多测试,不起作用,继续使用Tuple2)

    我已经从使用Tuple2切换到协议缓冲区消息:

    message KeyDay {
      optional ByteString key = 1;
      optional int64 timestamp_usec = 2;
    }
    

    但使用Tuple2.of()KeyDay.newBuilder().setKey(...).setTimestampUsec(...).build()更容易。

    当切换到一个键是一个派生自protobuf.Message的类时,问题消失了10-15天(因此数据大小对于Tuple2来说是个问题),但是将数据大小增加到20天就显示它就在那里。

0 个答案:

没有答案