Question

我有1000万条记录的源和目标csv文件，其中包含250列。我正在运行一个Apache Beam管道，该管道连接了源文件和目标文件中的所有列。何时，我在spark集群上运行此管道，但没有异常，但是可以正确执行，使用以下Spark属性时，连接梁度量标准计数器将返回double计数。 -执行程序内存“ 2g” 但是，当我将执行者内存增加到11g时，它将返回正确的计数。

我尝试了以下示例，

    Pipeline pipeline = Pipeline.create(options);
    final TupleTag<String> eventInfoTag = new TupleTag<>();
    final TupleTag<String> countryInfoTag = new TupleTag<>();



    PCollection<KV<String, String>> eventInfo =
    eventsTable.apply(ParDo.of(new ExtractEventDataFn()));
    PCollection<KV<String, String>> countryInfo =
    countryCodes.apply(ParDo.of(new ExtractCountryInfoFn()));



    PCollection<KV<String, CoGbkResult>> kvpCollection =
    KeyedPCollectionTuple.of(eventInfoTag, eventInfo)
        .and(countryInfoTag, countryInfo)
        .apply(CoGroupByKey.create());

    PCollection<KV<String, String>> finalResultCollection =
    kvpCollection.apply(
        "Process",
        ParDo.of(
            new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
              @ProcessElement
              public void processElement(ProcessContext c) {
                KV<String, CoGbkResult> e = c.element();
                String countryCode = e.getKey();
                String countryName = "none";
                countryName = e.getValue().getOnly(countryInfoTag);
                for (String eventInfo : c.element().getValue().getAll(eventInfoTag)) {
                    Metrics.counter("count", "errorcount").inc();
                  c.output(
                      KV.of(
                          countryCode,
                          "Country name: " + countryName + ", Event info: " + eventInfo));
                }
              }
            }));

    final PipelineResult result = pipeline.run();
    MetricQueryResults metrics =
        result
            .metrics()
            .queryMetrics(
                MetricsFilter.builder()
                    .addNameFilter(MetricNameFilter.inNamespace("count"))
                    .build());
    Iterable<MetricResult<Long>> counters = metrics.getCounters();
    for (MetricResult<Long> counter : counters) {
        System.out.println("Hi  >> "+counter.getName().getName() + " : " + counter.getAttempted() + " " + counter.getCommittedOrNull());

    }

我需要帮助。谢谢

Answer 1

public static void main(String[] args) {

        Configuration hadoopConf = new Configuration();
        hadoopConf.set("fs.defaultFS", args[13]);
        hadoopConf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
        hadoopConf.set("fs.file.impl",org.apache.hadoop.fs.LocalFileSystem.class.getName());
        final TupleTag<Row> sourceDataInfoTag = new TupleTag<Row>(){};
        final TupleTag<Row> targetDataInfoTag = new TupleTag<Row>(){};
        HadoopFileSystemOptions options = PipelineOptionsFactory.as(HadoopFileSystemOptions.class);
        options.setRunner(SparkRunner.class);
        options.setHdfsConfiguration(Collections.singletonList(hadoopConf));
        Pipeline pipeline = Pipeline.create(options);

        PCollection<String> sourceData = pipeline.apply(TextIO.read().from(args[14]).withDelimiter("\n".getBytes()));
        PCollection<KV<Row, Row>> sourceDataRows = sourceData.apply(ParDo.of(new ExtractFunction()));
        PCollection<String> targetData = pipeline.apply(TextIO.read().from(args[23]).withDelimiter("\n".getBytes()));
        PCollection<KV<Row, Row>> targetDataRows = targetData.apply(ParDo.of(new ExtractFunction()));

        PCollection<KV<Row, CoGbkResult>> kvpCollection = KeyedPCollectionTuple 
                .of(sourceDataInfoTag, sourceDataRows.setCoder(KvCoder.of(RowCoder.of(SOURCE_JOIN_RECORD_TYPE),RowCoder.of(SOURCE_RECORD_TYPE)))) 
                .and(targetDataInfoTag, targetDataRows.setCoder(KvCoder.of(RowCoder.of(TARGET_JOIN_RECORD_TYPE),RowCoder.of(TARGET_RECORD_TYPE)))) 
                .apply(CoGroupByKey.<Row>create()); 

        PCollection<GenericRecord> finalResultCollections = kvpCollection.apply("process",ParDo.of(new DoFn<KV<Row, CoGbkResult>, GenericRecord>() {
            @ProcessElement
            public void processElement(ProcessContext context) {
                KV<Row, CoGbkResult> element = context.element();
                Iterator<Row> srcIter = element.getValue().getAll(sourceDataInfoTag).iterator();
                Iterator<Row> trgIter = element.getValue().getAll(targetDataInfoTag).iterator();
                Metrics.counter("count", "count").inc();

                GenericRecordBuilder builder = new GenericRecordBuilder(SCHEMA);
                boolean done = false;
                boolean captureError = false;
                while (!done)
                {
                    // Some iterator data here.
                    .
                    .
                    builder.set(colName, data);
                    if(captureError){
                        GenericRecord record = builder.build();
                        context.output(record);
                    }
                }
            }
          })).setCoder(AvroCoder.of(GenericRecord.class, SCHEMA));

        finalResultCollections.apply("writeText",FileIO.<GenericRecord>write()
                .via(ParquetIO.sink(SCHEMA))
                .withSuffix(".parquet")
                .withPrefix("part")
                .to("hdfs://temp/"));


        final PipelineResult result = pipeline.run();
        State state = result.waitUntilFinish();

        MetricQueryResults metrics =
            result
                .metrics()
                .queryMetrics(
                    MetricsFilter.builder()
                        .addNameFilter(MetricNameFilter.inNamespace("count"))
                        .build());
        Iterable<MetricResult<Long>> counters = metrics.getCounters();
        for (MetricResult<Long> counter : counters) {
            System.out.println("Count  >> "+counter.getName().getName() + " : " + counter.getAttempted() + " " + counter.getCommittedOrNull());

        }

    }

Answer 2

在您的代码中，执行Metrics.counter("count", "errorcount")时将定义计数器。但是它是在一个循环中定义的，该循环也有点像一个循环（processElement）。您应将计数器定义为DoFn中的一个字段。不用担心DoFn被重用于处理包。如：private final Counter counter = Metrics.counter(MyClass.class, COUNTER_NAME); 另外，您只显示了部分代码，但我没有看到done布尔值设置为true。但这只是出于好奇。

最后但并非最不重要的一点是，您应该在Beam的主分支上尝试使用SparkRunner，因为昨天已合并了有关度量的修复程序（在同一JVM中运行多个管道时，度量值不会重置）。我不知道它是否与您的用例匹配，但值得尝试。

Apache Beam Metrics Counter使用SparkRunner提供不正确的计数

2 个答案: