Question

我们有一个云数据流作业，它接受一个BigQuery表，对其进行转换，然后根据该记录的时间戳中的月/年将每个记录写入不同的表。因此，当我们在具有12个月数据的表上运行工作时，应该有12个输出表。第一个月将是主要产出，其他11个月将是副产出。

我们发现，当我们运行10个月或更长时间（9个侧面输出）时，作业将会失败。

这是Cloud Dataflow的限制还是一个错误？

我在执行图中注意到它运行时有超过8个侧输出，有些输出表示“正在运行”，但它们似乎没有写任何记录。

以下是我们的一些工作ID：

2015-06-14_23_58_06-14457541029573485807（8面输出 - 通过）

2015-06-14_23_48_43-15277609445992188388（9侧输出 - 失败）

2015-06-14_23_11_46-10500077558949649888（7面输出 - 通过）

2015-06-14_22_38_48-1428211312699949403（3面输出 - 通过）

2015-06-14_21_44_27-16273252623089185131（11侧输出 - 失败）

这是处理数据的代码。没有涉及缓存。（TressOutputManager仅保存TupleTag<TableRow>）

的缓存

public class TressDenormalizationDoFn extends DoFn<TableRow, TableRow> {
    @Inject
    @Named("tress.mappers")
    private Set<CPTMapper> mappers;
    @Inject
    private TressOutputManager tuples;

    @Override
    public void processElement(ProcessContext c) throws Exception {
        TableRow row = c.element().clone();
        for (CPTMapper mapper : mappers) {
            String mapped = mapper.map((String) row.get("event"));
            if (mapped != null) {
                row.set(mapper.getId(), mapped);
            }
        }
        // places the record in the correct month based on the time stamp
        String timeStamp = (String) row.get("time_local");
        if(timeStamp != null){
            timeStamp = timeStamp.substring(0, 7).replaceAll("-", "_");

            if (tuples.isMainOutput(timeStamp)) {
                c.output(row);
            } else {
                c.sideOutput(tuples.getTuple(timeStamp), row);
            }
        }
    }
}

Google云端数据流中的旁边输出数量是否有限制？

0 个答案: