我有1000万条记录的源和目标csv文件,其中包含250列。 我正在运行一个Apache Beam管道,该管道连接了源文件和目标文件中的所有列。 何时,我在spark集群上运行此管道,但没有异常,但是可以正确执行, 使用以下Spark属性时,连接梁度量标准计数器将返回double计数。 -执行程序内存“ 2g” 但是,当我将执行者内存增加到11g时,它将返回正确的计数。
我尝试了以下示例,
Pipeline pipeline = Pipeline.create(options);
final TupleTag<String> eventInfoTag = new TupleTag<>();
final TupleTag<String> countryInfoTag = new TupleTag<>();
PCollection<KV<String, String>> eventInfo =
eventsTable.apply(ParDo.of(new ExtractEventDataFn()));
PCollection<KV<String, String>> countryInfo =
countryCodes.apply(ParDo.of(new ExtractCountryInfoFn()));
PCollection<KV<String, CoGbkResult>> kvpCollection =
KeyedPCollectionTuple.of(eventInfoTag, eventInfo)
.and(countryInfoTag, countryInfo)
.apply(CoGroupByKey.create());
PCollection<KV<String, String>> finalResultCollection =
kvpCollection.apply(
"Process",
ParDo.of(
new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
@ProcessElement
public void processElement(ProcessContext c) {
KV<String, CoGbkResult> e = c.element();
String countryCode = e.getKey();
String countryName = "none";
countryName = e.getValue().getOnly(countryInfoTag);
for (String eventInfo : c.element().getValue().getAll(eventInfoTag)) {
Metrics.counter("count", "errorcount").inc();
c.output(
KV.of(
countryCode,
"Country name: " + countryName + ", Event info: " + eventInfo));
}
}
}));
final PipelineResult result = pipeline.run();
MetricQueryResults metrics =
result
.metrics()
.queryMetrics(
MetricsFilter.builder()
.addNameFilter(MetricNameFilter.inNamespace("count"))
.build());
Iterable<MetricResult<Long>> counters = metrics.getCounters();
for (MetricResult<Long> counter : counters) {
System.out.println("Hi >> "+counter.getName().getName() + " : " + counter.getAttempted() + " " + counter.getCommittedOrNull());
}
我需要帮助。 谢谢
答案 0 :(得分:0)
public static void main(String[] args) {
Configuration hadoopConf = new Configuration();
hadoopConf.set("fs.defaultFS", args[13]);
hadoopConf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
hadoopConf.set("fs.file.impl",org.apache.hadoop.fs.LocalFileSystem.class.getName());
final TupleTag<Row> sourceDataInfoTag = new TupleTag<Row>(){};
final TupleTag<Row> targetDataInfoTag = new TupleTag<Row>(){};
HadoopFileSystemOptions options = PipelineOptionsFactory.as(HadoopFileSystemOptions.class);
options.setRunner(SparkRunner.class);
options.setHdfsConfiguration(Collections.singletonList(hadoopConf));
Pipeline pipeline = Pipeline.create(options);
PCollection<String> sourceData = pipeline.apply(TextIO.read().from(args[14]).withDelimiter("\n".getBytes()));
PCollection<KV<Row, Row>> sourceDataRows = sourceData.apply(ParDo.of(new ExtractFunction()));
PCollection<String> targetData = pipeline.apply(TextIO.read().from(args[23]).withDelimiter("\n".getBytes()));
PCollection<KV<Row, Row>> targetDataRows = targetData.apply(ParDo.of(new ExtractFunction()));
PCollection<KV<Row, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
.of(sourceDataInfoTag, sourceDataRows.setCoder(KvCoder.of(RowCoder.of(SOURCE_JOIN_RECORD_TYPE),RowCoder.of(SOURCE_RECORD_TYPE))))
.and(targetDataInfoTag, targetDataRows.setCoder(KvCoder.of(RowCoder.of(TARGET_JOIN_RECORD_TYPE),RowCoder.of(TARGET_RECORD_TYPE))))
.apply(CoGroupByKey.<Row>create());
PCollection<GenericRecord> finalResultCollections = kvpCollection.apply("process",ParDo.of(new DoFn<KV<Row, CoGbkResult>, GenericRecord>() {
@ProcessElement
public void processElement(ProcessContext context) {
KV<Row, CoGbkResult> element = context.element();
Iterator<Row> srcIter = element.getValue().getAll(sourceDataInfoTag).iterator();
Iterator<Row> trgIter = element.getValue().getAll(targetDataInfoTag).iterator();
Metrics.counter("count", "count").inc();
GenericRecordBuilder builder = new GenericRecordBuilder(SCHEMA);
boolean done = false;
boolean captureError = false;
while (!done)
{
// Some iterator data here.
.
.
builder.set(colName, data);
if(captureError){
GenericRecord record = builder.build();
context.output(record);
}
}
}
})).setCoder(AvroCoder.of(GenericRecord.class, SCHEMA));
finalResultCollections.apply("writeText",FileIO.<GenericRecord>write()
.via(ParquetIO.sink(SCHEMA))
.withSuffix(".parquet")
.withPrefix("part")
.to("hdfs://temp/"));
final PipelineResult result = pipeline.run();
State state = result.waitUntilFinish();
MetricQueryResults metrics =
result
.metrics()
.queryMetrics(
MetricsFilter.builder()
.addNameFilter(MetricNameFilter.inNamespace("count"))
.build());
Iterable<MetricResult<Long>> counters = metrics.getCounters();
for (MetricResult<Long> counter : counters) {
System.out.println("Count >> "+counter.getName().getName() + " : " + counter.getAttempted() + " " + counter.getCommittedOrNull());
}
}
答案 1 :(得分:0)
在您的代码中,执行Metrics.counter("count", "errorcount")
时将定义计数器。但是它是在一个循环中定义的,该循环也有点像一个循环(processElement)。您应将计数器定义为DoFn中的一个字段。不用担心DoFn被重用于处理包。如:private final Counter counter = Metrics.counter(MyClass.class, COUNTER_NAME);
另外,您只显示了部分代码,但我没有看到done
布尔值设置为true。但这只是出于好奇。
最后但并非最不重要的一点是,您应该在Beam的主分支上尝试使用SparkRunner,因为昨天已合并了有关度量的修复程序(在同一JVM中运行多个管道时,度量值不会重置)。我不知道它是否与您的用例匹配,但值得尝试。